This accompaniment documentation is part of the final assessment in module CASA0005: Geographic Information Systems and Science (19/20)

Overview

Since 2014, all the big retailers in Israel are obliged by law to publish on an hourly basis a full data set of stores and prices of all the products that are being sold in each store. According to the Israeli anti-trust authority, the intention of this act is to mitigate the cost of living in Israel and to encourage competition between different retailers, supermarkets and pharmacies.

This data set allows us to examine the relationship between location and product pricing in the retail sector, and even detect geographic monopolies, where a single retailer provides product or service to a local area.

The Research Question

This paper addresses and examines the following question: What is the impact of the distance from a retailer to their competitors on product pricing? The hypothesis which is being examined accordingly is that the pricing gets higher as the retailer is farther from their competitors and vice versa.

Methodology

A common practice of defining catchment areas of stores is based on network analysis of roads and other transportation infrastructure. This work takes a different approach, and suggests using Voronoi tessellation as a simpler interpolation for this purpose. Voronoi tessellation builds a polygon around each point in a data set, this polygon covers all the points that are the closest to the point seed rather than to any other point in the data set.

This analysis compares the prices of a single package of 500g raw sesame Tahini, a popular middle eastern condiment, between 68 different stores in central Tel Aviv, Israel. The analysis was carried out in the R environment for statistical computing and visualisation.

The following flowchart summarises the three steps of the process: data preparation, spatial analysis and descriptive and inferential statistics:

Flowchart of analysis process

Data Preparation

First, let’s load our packages

Specifying Hebrew as input language

## [1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"

Creating an Area of Interest Polygon

We will be focusing in our analysis on central Tel Aviv, the economic centre of Israel. First, we need to download the quarters shapefile which is available on the municipal QIS platform (Layers > Zones Boundaries > Quarters > Download):

Once we have downloaded the shapefile, let’s create a single polygon which represents the city centre: quarters 3,4,5 and 6:

Now let’s check that we have the right quarters and plot these two sf objects together:

Quarters of Tel Aviv-Jaffa, city centre is highlighted in red

Out area of interest is shown in red, with four different quarters. Let’s use the st_combine function to create a single polygon:

City centre

Great, now we have a single sp object with a single polygon for the city centre.

Creating a Stores Data Set

We’ll be using an API service to find the attributes of each store in Tel Aviv-Jaffa. This service requires an API key, which can be obtained from here (in Hebrew).

   store_id           chain_id         sub_chain_id         city_id           store_area         store_type         store_name       
 Length:160         Length:160         Length:160         Length:160         Length:160         Length:160         Length:160        
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
  store_code        store_address      store_zip_code     store_gps_lat      store_gps_lng      store_update_time  store_insert_time 
 Length:160         Length:160         Length:160         Length:160         Length:160         Length:160         Length:160        
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
  city_name          city_name2         city_name3        city_gps_lat       city_gps_lng        chain_name         chain_code       
 Length:160         Length:160         Length:160         Length:160         Length:160         Length:160         Length:160        
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 chain_image        sub_chain_name     sub_chain_code     sub_chain_image    sub_chain_is_pharm
 Length:160         Length:160         Length:160         Length:160         Length:160        
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  

Now we need to get rid of pharmacies and online stores, using the filter function within the dplyr package:

Now we need to fix manually wrong or missing XY data of some of the stores:

Since the chain names are in Hebrew, it might be beneficial to add a new column with English chain names. First, let’s create a factor with unique chain names in Hebrew:

 [1] "שופרסל"              "אושר עד"             "דור אלון"            "טיב טעם"             "מחסני להב"           "ויקטורי"            
 [7] "פרש מרקט"            "רמי לוי שיווק השקמה" "מ. יוחננוף ובניו"    "יינות ביתן"          "מגה"                 "עדן טבע"  

Now we can create a new factor with English names, create a new data frame as a dictionary, and then add a new column to filteredStores with English chain names:

Now we need to convert our filteredStores data frame to a sp object:

Quarters of Tel Aviv-Jaffa and Store Locations

Creating a Price Data Set

Now it’s our time to find the price of Tahini in each store in the city centre. First, we need to create a new sp object for all the relevant stores:

Now, let’s create a new factor that holds all the relevant store IDs, and an empty factor that will hold the price of Tahini in each store:

Now we need to specify the parameters for our API call:

  • url - hasn’t changed
  • api_key - hasn’t changed
  • action is set to GetPriceByProductBarCode
  • product_barcode is set to "7290000112220" (The product barcode for “Tahini Achva”, a popular Tahini brand)
  • store_id is set to the ith element of the storeIDs factor. We will iterate over this factor using a for loop

We’re all set! Now we can run a for loop that actually sends a single API call for each store, and then pushes the Tahini price in that store into the prices factor. If this product isn’t available in that store, an NA value is stored instead.

Now we can create a new data frame with two columns:

  • StoreID
  • Price

Creating a Buildings Data Set

Our last data preparation is to find all the buildings within our area of interest. This time we’ll use a different API service, https://osmbuildings.org/, which provides access to buildings polygons based on Open Street Map data. Due to the fact this service provides a single 15-zoom tile per call, we need to find all the relevant tiles in our area of interest.

First, let’s find the bounding box of our area of interest, using the st_bbox function in sf package:

    xmin     ymin     xmax     ymax 
34.75689 32.05687 34.80976 32.10244 

Now we can use the bbox_to_tile_grid function in slippymath package to create a new data frame with all the relevant tiles for our area of interest. Each tile is specified by a combination of x and y values:

      x     y
1 19547 13295
2 19548 13295
3 19549 13295
4 19550 13295
5 19551 13295
6 19552 13295
[1] 36

We will be using a for loop to iterate over the tiles data frame and get all the buildings within the area of interest. Lets try to get all the buildings within a single tile, x=19550, y=13298. We’ll use the readOGR function in the rgdal package to convert a geojson to sp. The ducomentation for API calls in OSM Buildings is available here.

Buildings within a single tile

Now we can clean the Buildings object and run a for loop that iterates over the tiles object to get a full coverage of the area of interest. We’ll use the sprintf function to format the API call with the right XY values for each tile. We’ll also use the duplicated function to eliminate duplicated buildings which are included in more than one API response:

Buildings within the area of interest

Data Analysis

In this analysis, we would like to know if there is any correlation between the catchment area of each store and the price of Tahini in each store within the area of interest. In order to find the catchment area of each store, we will use a special method called voronoi tessellation, which creates a single polygon for each store. Each polygon covers all the urban area that is closer to one specific store than to any other store.

Voronoi Tessellation

To run this procedure, we’ll user the voronoi function in the dismo package, with the filteredStoressp object as input.

Voronoi diagram

It can clearly be seen that “Mega” has the largest catchment area, but we must acknowledge the limitations of using voronoi tessellation: the outer polygons are the most prone to distortion, due to the fact that we hadn’t included any stores outside of Tel Aviv-Jaffa. These stores, in fact, affecting the size of the catchment areas of stores within Tel Aviv - because they might serve people who live or work in Tel Aviv.

In order to mitigate this limitation, the analysis includes only stores within the area of interest:

Filtered Voronoi polygons with AOI in red

There are two main problems with this selected set of polygons:

  1. Some of the polygons are stretching far beyond the area of interest
  2. In some cases, two neighbouring polygons share the same chain, this fact makes them a single catchment area

In order to this issue, we’ll first dissolve all the polygons that share the same chain value, and then split the merged layer back into different polygons. Then, due to the fact that some of the polygons are actually covering the Mediterranean Sea, we’ll count and sum the built area within each catchment area.

Converting Voronoi Polygons to Catchment Areas

First, we need to make sure that filteredVorCityCentre is in the right projection, WGS84. Then, we’ll merge adjacent voronoi polygons who share the same chain, using the gUnaryUnion function in the rgeos package. This procedure returns an object without any attribute data, so we’ll have to restore the chain value for each merged polygon.

Now we can split the vorWGS84Union layer into single polygons using the disaggregate function, and then we give a unique identifier for each catchment area.

Catchment areas by chain code

Summing Built Area per Catchment Area

Once we have a layer of catchment areas with chain code values, we can go further and find the sum of the built area in each catchment area.

Cropping the Buildings Layer

First, let’s crop the buildings layer and pick only the buildings inside our area of interest, using the intersect function in the raster package. Then we need to make sure that the layer is in the right projection, WGS84:

Now, using the intersect function, we can intersect the buildings layer with the voronoi layer. Each polygon in our new layer buildingsVoronoi will have the attributes of both input layers. Then, we’ll calculate the area of each polygon in square meters using the area function.

Buildings in area of interest by catchment chain ID

Summarising Built Area per Catchment Area

The next step is to sum up the built area for each catchment area by using the aggregate, the output is a single data frame, sumArea, with two columns: catchment_id and catchment_sum_area:

  catchment_id catchment_sum_area
1            1           4736.143
2            2          82949.349
3            3         172744.090
4            4          50512.904
5            5          47038.235
6            6         127774.424

Now we can do another tabular join and create a new column in the vorWGS84UnionDis object with the size of built area within each catchment area:

Aggregating Data into a Single Store Data Frame

The final step is to aggregate the data into a single store data frame, in this data frame each record is a store within the area of interest with a value of built area (in sq meter) of the catchment area that the store falls into. We’ll use a spatial join in order to match each store to a single catchment area:

  1. First, we’ll extract all the stores in the city centre
  2. Using a tabular join, we’ll add to the new store data set a new column with the price of Tahini (in ILS) in each store
  3. Then we’ll convert the catchment area polygons and the store points to sf to preform a spatial join using the st_as_sf function
  4. The next step is to spatially join these two sf objects, using the st_join function
  5. Then we’ll drop any stores with missing price value
  6. Finally, we’ll create another tabular join in order to add the English chain names for each store, and then export the data to external data frame

Descriptive and Inferential Statistics

Once we have a single data frame, sjoinCleanDF which aggregates the data of our spatial analysis, we can dive into our statistical analysis.

Descriptive Statistics

sjoinCleanDF is consisted of 68 records and 33 variables. By creating a simple pivot table, we can understand our data better. Let’s summarise our data by chain name, with the following statistics:

  1. Mean Tahini price by chain
  2. Store count by chain
  3. Mean size of built area in catchment area by chain
  4. Sum of all built area in catchment areas by chain

WE’ll use the dplyr package and the functions select, group_by and summarise to create the pivot table.

chainNameEng meanPrice storeCount meanCatchmentArea sumCatchmentArea
Dor Alon 16.98 26 364,790.11 9,484,542.81
Tiv Taam 16.45 11 165,039.89 1,815,438.77
Shufersal 15.17 14 173,797.47 2,433,164.59
Eden Teva 14.90 1 61,748.04 61,748.04
Victory 14.20 4 36,411.64 145,646.57
Mega 11.33 11 469,886.63 5,168,752.93
Fresh Market 10.90 1 14,848.00 14,848.00

plot_ly package is useful for simple visualisations, like the following pie chart:

Stores within Area of Interest by Chain

We can learn from this table that Dor Alon is the priciest chain with mean price of 16.98 NIS per single package of Tahini. This chain is also the chain with the highest number of stores within the area of interest - 26 stores. The highest mean built area within a single catchment area belongs to Mega (470 thousand square meters), but the highest total sum of built area within all the catchment areas belongs to Dor Alon as well, almost 9.5 million square meters. These findings might be an evidence to a correlation between the size of built area within catchment areas and the price of Tahini in each store.

Let’s continue with some summary statistics and basic visualisation for each of the following variables:

  1. Tahini Price (pricesFloat) - dependent variable
  2. Size of store’s catchment area (catchment_sum_area) - independent varialbe

Tahini Price

We’ll use the stat.desc function in the pastecs package to extract descriptive statistics:

x
nbr.val 68.00
nbr.null 0.00
nbr.na 0.00
min 10.00
max 18.90
range 8.90
sum 1,041.90
median 16.90
mean 15.32
SE.mean 0.32
CI.mean.0.95 0.64
var 6.93
std.dev 2.63
coef.var 0.17

plot_ly package allows us to create a histogram with price distribution:

Histogram of Tahini Prices within Area of Interest

Size of store’s catchment area

x
nbr.val 68.00
nbr.null 0.00
nbr.na 0.00
min 14,848.00
max 713,576.21
range 698,728.21
sum 19,124,141.73
median 172,744.09
mean 281,237.38
SE.mean 30,494.28
CI.mean.0.95 60,866.84
var 63,233,272,984.51
std.dev 251,462.27
coef.var 0.89
Histogram of Built Area within a Single Catchment Area

Pearson Correlation and Linear Regression

As mentioned above, the null hypothesis is that there is no linear association between the built area within the catchment area of each store, and the price of Tahini in that store. To reject this hypothesis, Pearson correlation coefficient and a simple linear regression were calculated.

Pearson Correlation Coefficient

We’ll use the cor.test function to find the Pearson correlation for these two variables:

    Pearson's product-moment correlation

data:  sjoinCleanDF$pricesFloat and sjoinCleanDF$catchment_sum_area
t = 0.21397, df = 66, p-value = 0.8312
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.2134361  0.2631023
sample estimates:
      cor 
0.0263288 

The built area within the catchment area and the price of Tahini in each store were not found to be significantly correlated, r(66) = .03, p = .83, for significance level of 5%.

Simple Linear Regression

A simple linear regression was calculated to predict the price of Tahini in each store based on the size of built area within the catchment area which the store falls into.

We’ll use the lm function to create a linear regression model, and the plot_ly package to create a scatter plot with a trend line.

Call:
lm(formula = sjoinCleanDF$pricesFloat ~ sjoinCleanDF$catchment_sum_area)

Residuals:
   Min     1Q Median     3Q    Max 
-5.418 -1.383  1.560  1.639  3.459 

Coefficients:
                                     Estimate    Std. Error t value            Pr(>|t|)    
(Intercept)                     15.2445243931  0.4844950730  31.465 <0.0000000000000002 ***
sjoinCleanDF$catchment_sum_area  0.0000002757  0.0000012885   0.214               0.831    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.652 on 66 degrees of freedom
Multiple R-squared:  0.0006932, Adjusted R-squared:  -0.01445 
F-statistic: 0.04578 on 1 and 66 DF,  p-value: 0.8312
Scatter Plot: price of Tahini in each store based on the size of built area within the catchment area which the store falls into

As expected, no significant regression equation was found (F(1,66) = .05, p = .83), with R2 = 0.

Conclusion

No evidence of association between the size of built area within catchment areas and Tahini prices in different stores in central Tel Aviv was found in this analysis. This isn’t to say that this association does not exist: since the store data set contains only retailers that are being regulated by the government (mainly big retailers), this analysis actually ignores little grocery stores that are probably affecting the real size of each catchment area. In addition, this analysis considers only one product within small geographic area - hence there is room for a wider investigation of bigger regions and wider range of products.

Another potential analysis might use the total built area within catchment areas by chain and local authority - with an updated hypothesis of correlation between the mean price of a product by retailer and local authority, and the total built area within the catchment areas of this retailer in each local authority.

Appendix: How to Make Middle Eastern Tahini?

Use this quick recipe to make you own delicious homemade Tahini. It’s super easy, and super healthy too. Enjoy!