ECON6027 Spacial Econometrics and Data Analysis Notes

Author

Rachel Kheng

1 Introduction to Spatial Data

1.1 What is Spatial Data?

Spatial data consists of positional information, answering the question “where is it” (on Earth, body, Sun, moon etc). Many empirical data contain not only information about the attribute of interest (i.e. the response/variable being studied), but also other variables that denote the geographic location where the response was observed. In particular, spatial data has a spatial reference: they have (i) coordinates and (ii) a system of reference for those coordinates (aka a coordinate reference system, CRS).

Examples of spatial data
  • Locations of volcano peaks on Earth
    • We could list the coordinates for all known volcanoes as pairs of longitude and latitude decimal degree with respect to the prime meridian of Greenwich and zero latitude at the equator (known as the World Geodetic System - WGS84).
  • Temperature at various residential towns around Singapore
    • We can measure the temperature for these areas around Singapore and make use of the CRS - SVY21.
  • Dengue clusters around Singapore *Similarly, we can collect data on dengue clusters around Singapore and make use of CRS - SVY21.

To verify if the CRS used in the dataset is accurate, we can refer to the Coordinate Systems Worldwide - https://epsg.io/.

1.2 Key Features of Spatial Data

One of the key features of spatial data is the auto-correlation (i.e. correlation with itself) of observations of observations in space. Observations in close spatial proximity tend to be more similar than for observations that are more spatially separated. We can understand that \(Cor({X_1},{X_2})\) ranges from \(-1\) to \(1\) and that correlation increases as \(X_1\) and \(X_2\) get closer and closer to each other.

Essentially, spatial data analysis developed concurrently across many disciplines, including geology, geography, and meteorology, as well as other non-traditional statistical areas. This led to the development of various methodologies, from traditional statistical techniques like linear models, response surface theory, and spatial statistics to spatial econometric approaches. The primary challenge today is to effectively combine these diverse tools to suit our specific analytical needs.

1.3 Types of Spatial Data: Theoretical Classification

We denote a spatial process in \(D\) dimensions as: \({Z(s):s\in D \subset R^d}\) where \(Z\) is the observed attribute of location s, a \((D \times 1)\) vector of coordinates. The spatial data types are distinguished through characteristics of the domain \(D\).

1.3.1 Geospatial / Geostatistical / Earth Data

Domain \(D\) is continuous s.t. \(Z(s)\) can be observed anywhere in \(D\). This means that between any two sample locations, we can theoretically place an infinite number of other samples. For example, I can measure temperature at multiple locations in the same classroom.

Examples of Geostatistical Data
  • For example, consider measuring air temperature or PSI value, which at least in theory, can be recorded at any location.
  • However, we will only observe a finite number of observations (e.g. only taking temperature at Dhoby Ghaut, which is 1 location, making it one sample).
    • We would have monitoring stations at some locations and use that data to reconstruct the ozone layer, as compared to taking measurements across the entire region, which is costly and operationally impossible.
  • Other examples include ozone layer concentration of a certain material, ground communication levels etc.

Due to the continuity of \(D\), geostatistical data is also known as spatial data with continuous variation. Since the spatial domain is continuous, it cannot be sampled exhaustively and an important task in the analysis is the reconstruction of the surface of the attribute \(Z\) over the entire domain, i.e. mapping of \(Z(s)\).

Important

Continuity is associated with the domain and not the attribute itself (which may be discrete or continuous or even nominal or ordinal). For example, temperature can be measured using a Celsius scale (continuous) or an ordinal scale (discrete).

1.3.2 Point Patterns

A point pattern has a random domain and the attribute itself will be Degenerate/binary in nature (i.e. it is either there or not there). If along with the location of an event, if we observe a stochastic attribute, then it is called a marked pattern.

Example of Point Patterns
  • Point Patterns:
    • Locations of lightning strikes
    • Locations at which weed emerge in a garden
    • Location of lunar craters
  • Marked Patterns:
    • Observing the size of the lunar crater along with the location

1.3.3 Areal / Lattice / Regional Data

2 Introduction to R

2.1 Install ‘sf’ package

When we start up our project, we will need to make sure we install and load the necessary packages. The ‘sf’ package should already be installed so we just need to load the ‘sf’ package from our library using the following code:

is.element('sf', installed.packages())
[1] TRUE
library(sf)
Linking to GEOS 3.13.1, GDAL 3.11.0, PROJ 9.6.0; sf_use_s2() is TRUE

2.2 Reading a shapefile and geojson to R

When we load the dataset, we will need to use the st_read() command so as to read the file to R. This applies for both shapefiles and geojson files.

2.2.1 Reading shapefile

sg_pa = st_read("C:/Users/rache/Downloads/MasterPlan2014SubzoneBoundaryWebSHP")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\Users\rache\Downloads\MasterPlan2014SubzoneBoundaryWebSHP' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

2.2.2 Reading geojson file

sg_spt = st_read("C:/Users/rache/Downloads/SportSGSportFacilitiesGEOJSON.geojson")
Reading layer `SportSGSportFacilitiesGEOJSON' from data source 
  `C:\Users\rache\Downloads\SportSGSportFacilitiesGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 35 features and 2 fields
Geometry type: POLYGON
Dimension:     XYZ
Bounding box:  xmin: 103.6932 ymin: 1.285977 xmax: 103.9526 ymax: 1.435855
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
How do I ensure R reads my files properly?

In case that your data set is not in the same folder as the R Markdown file, use the file path instead to ensure that the file is read properly in R.

2.3 Inspecting our shapefile or geojson file

Now that we have loaded our dataset, we will need to inspect the data as we will need to make sure that there are no false entries. These false entries could be due to overlapping polygons. As a habit, we should always check the validity especially if the dataset is downloaded from the Internet.

Important

Always validate the dataset before we do anything else!

class(sg_pa)
[1] "sf"         "data.frame"
summary(sg_pa)
    OBJECTID       SUBZONE_NO      SUBZONE_N          SUBZONE_C        
 Min.   :  1.0   Min.   : 1.000   Length:323         Length:323        
 1st Qu.: 81.5   1st Qu.: 2.000   Class :character   Class :character  
 Median :162.0   Median : 4.000   Mode  :character   Mode  :character  
 Mean   :162.0   Mean   : 4.625                                        
 3rd Qu.:242.5   3rd Qu.: 6.500                                        
 Max.   :323.0   Max.   :17.000                                        
    CA_IND           PLN_AREA_N         PLN_AREA_C          REGION_N        
 Length:323         Length:323         Length:323         Length:323        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   REGION_C           INC_CRC            FMEL_UPD_D             X_ADDR     
 Length:323         Length:323         Min.   :2014-12-05   Min.   : 5093  
 Class :character   Class :character   1st Qu.:2014-12-05   1st Qu.:21864  
 Mode  :character   Mode  :character   Median :2014-12-05   Median :28465  
                                       Mean   :2014-12-05   Mean   :27257  
                                       3rd Qu.:2014-12-05   3rd Qu.:31674  
                                       Max.   :2014-12-05   Max.   :50425  
     Y_ADDR        SHAPE_Leng        SHAPE_Area                geometry  
 Min.   :19579   Min.   :  871.6   Min.   :   39438   MULTIPOLYGON :323  
 1st Qu.:31776   1st Qu.: 3709.6   1st Qu.:  628261   epsg:NA      :  0  
 Median :35113   Median : 5211.9   Median : 1229894   +proj=tmer...:  0  
 Mean   :36106   Mean   : 6524.4   Mean   : 2420882                      
 3rd Qu.:39869   3rd Qu.: 6942.6   3rd Qu.: 2106483                      
 Max.   :49553   Max.   :68083.9   Max.   :69748299                      
head(sg_pa)
Simple feature collection with 6 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 24468.89 ymin: 28369.47 xmax: 32362.39 ymax: 30542.74
Projected CRS: SVY21
  OBJECTID SUBZONE_NO      SUBZONE_N SUBZONE_C CA_IND      PLN_AREA_N
1        1          1   MARINA SOUTH    MSSZ01      Y    MARINA SOUTH
2        2          1   PEARL'S HILL    OTSZ01      Y          OUTRAM
3        3          3      BOAT QUAY    SRSZ03      Y SINGAPORE RIVER
4        4          8 HENDERSON HILL    BMSZ08      N     BUKIT MERAH
5        5          3        REDHILL    BMSZ03      N     BUKIT MERAH
6        6          7 ALEXANDRA HILL    BMSZ07      N     BUKIT MERAH
  PLN_AREA_C       REGION_N REGION_C          INC_CRC FMEL_UPD_D   X_ADDR
1         MS CENTRAL REGION       CR 5ED7EB253F99252E 2014-12-05 31595.84
2         OT CENTRAL REGION       CR 8C7149B9EB32EEFC 2014-12-05 28679.06
3         SR CENTRAL REGION       CR C35FEFF02B13E0E5 2014-12-05 29654.96
4         BM CENTRAL REGION       CR 3775D82C5DDBEFBD 2014-12-05 26782.83
5         BM CENTRAL REGION       CR 85D9ABEF0A40678F 2014-12-05 26201.96
6         BM CENTRAL REGION       CR 9D286521EF5E3B59 2014-12-05 25358.82
    Y_ADDR SHAPE_Leng SHAPE_Area                       geometry
1 29220.19   5267.381  1630379.3 MULTIPOLYGON (((31495.56 30...
2 29782.05   3506.107   559816.2 MULTIPOLYGON (((29092.28 30...
3 29974.66   1740.926   160807.5 MULTIPOLYGON (((29932.33 29...
4 29933.77   3313.625   595428.9 MULTIPOLYGON (((27131.28 30...
5 30005.70   2825.594   387429.4 MULTIPOLYGON (((26451.03 30...
6 29991.38   4428.913  1030378.8 MULTIPOLYGON (((25899.7 297...
table(st_is_valid(sg_pa))

FALSE  TRUE 
    9   314 
What do the commands mean?
  • class(sg_pa): Used to determine the class of the object, e.g., data.frame.
  • summary(sg_pa): Calculates a 6-number summary including the minimum value, first quartile (25th percentile), median (50th percentile), mean, third quartile (75th percentile), and maximum value.
  • head(sg_pa): Inspects the first 6 rows of the dataset.
  • table(st_is_valid(sg_pa)): Creates a table for the object and checks the validity of the object.

2.3.1 Validating objects due to false entries

If there are false entries, use the command st_make_valid() to validate the object.

sg_pa = st_make_valid(sg_pa)
table(st_is_valid(sg_pa))

TRUE 
 323 
What do the commands mean?
  • sg_pa = st_make_valid(sg_pa): Used for making the object valid.
  • table(st_is_valid(sg_pa)): Used for creating a table from the newly validated object.

2.4 Visualize the shapefile and geojson file

Now, we can visualize our dataset in the form of maps by using the plot() command. If there are many attributes in the dataset, we can use the max.plot() command together with the plot() command to limit the number of maps generated. If we wish to plot a map using a specific attribute, then we need to specify the attribute in the plot() command.

2.4.1 Shapefile

plot(sg_pa, max.plot = 12)

plot(sg_pa["PLN_AREA_N"])

What do the commands mean?
  • plot(sg_pa, max.plot = 12): Plots our object; the max.plot function generates the number of maps specified (e.g., 12).
  • plot(sg_pa["PLN_AREA_N"]): Plots our object based on the specified attribute we want (e.g., PLN_AREA_N).

2.4.2 geojson file

Before we can plot our dataset, we will need to extract the geometry for our geojson file. This is so that we can seoarate the spatial data freom other attribute information, allowing us to visualize, process, and interact with the geographic shapes (points, lines, polygons) using R. This process is necessary because geometry is what defines the map features, and extracting it provides a clean dataset for tasks like rendering maps, performing spatial analysis, or exchanging data between different platforms.

st_geometry(sg_spt)
Geometry set for 35 features 
Geometry type: POLYGON
Dimension:     XYZ
Bounding box:  xmin: 103.6932 ymin: 1.285977 xmax: 103.9526 ymax: 1.435855
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
First 5 geometries:
POLYGON Z ((103.7628 1.310547 0, 103.7628 1.310...
POLYGON Z ((103.7644 1.311377 0, 103.7646 1.311...
POLYGON Z ((103.6937 1.33894 0, 103.6937 1.3389...
POLYGON Z ((103.8723 1.323252 0, 103.8723 1.323...
POLYGON Z ((103.8774 1.306804 0, 103.8787 1.306...
What do the commands mean?
  • st_geometry(sg_spt): Extracts the geometry.

After extracting the geometry, we can now plot the spatial data. To add colour to our plots, we can make use of the command col = “colour” as well as border = “colour” so that the plots are filled in colour. Additionally, the command main = “title” will create a title for the plot.

plot(sg_spt$geometry, col = "green", border = "darkblue", main="SportSG Sport Facilities Locations")

Lastly, to load the plot of the dengue clusters on the Singapore map, we can make use of the library tmap as well as the command tm_shape(), which specifies the spatial data object or “shape” that serves as the base layer for creating a thematic map. Essentially, this defines the spatial context for subsequent map layers. Then, we use the tm_polygons() command to draw polygons on top of the map layer. Finally, to view the map that we have created, we use the command tmap_mode() where we can set the mode to (i) plot, which creates default and static maps, while (ii) view creates interactive maps that can be zoomed in and out, allow for changing background tiles (basemaps), or click on map objects to get some additional information.

library(tmap)
sploc_map = tm_shape(sg_spt$geometry) + tm_polygons(col = "green")
tmap_mode("view")
ℹ tmap mode set to "view".
sploc_map
What do the commands mean?
  • tmap package is used for creating thematic maps.
  • The following commands create a base map and an interactive view:
    • tm_shape: Specifies the spatial data object for the base layer and creates polygons on top of the map layer.
    • tmap_mode: Sets the created map to be an interactive map.