1 Installing and Loading Required Packages

In this in-class exercise, three R packages will be used:

  • spatstat, which has a wide range of useful functions for point pattern analysis.
  • raster which reads, writes, manipulates, analyses and model of gridded spatial data (i.e. raster).
  • maptools which provides a set of tools for manipulating geographic data. In this hands-on exercise, we mainly use it to convert Spatial objects into ppp format of spatstat.
## Warning: package 'maptools' was built under R version 4.0.2
## Warning: package 'spatstat' was built under R version 4.0.2
## Warning: package 'spatstat.data' was built under R version 4.0.2
## Warning: package 'nlme' was built under R version 4.0.2
## Warning: package 'tmap' was built under R version 4.0.2

2 Spatial Data Wrangling

2.1 Importing the spatial data

In this section, readOGR() of rgdal package will be used to import the three geospatial data in R’s SpatialPolygonsDataFrame or SpatialPointsDataFrame.These data are not in simple features, but rather, they are in Spatial classes.

## OGR data source with driver: ESRI Shapefile 
## Source: "D:\IS415-AY2020-21T1\06-Submission\In-class Exercise\In-class_Ex05\In-class_Ex05_CHEONG WEI HERNG EUGENE\In Class Exercise 05\data", layer: "CHILDCARE"
## with 1312 features
## It has 18 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "D:\IS415-AY2020-21T1\06-Submission\In-class Exercise\In-class_Ex05\In-class_Ex05_CHEONG WEI HERNG EUGENE\In Class Exercise 05\data", layer: "CostalOutline"
## with 60 features
## It has 4 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "D:\IS415-AY2020-21T1\06-Submission\In-class Exercise\In-class_Ex05\In-class_Ex05_CHEONG WEI HERNG EUGENE\In Class Exercise 05\data", layer: "MP14_SUBZONE_WEB_PL"
## with 323 features
## It has 15 fields

We can plot these geospatial data in one plot by using code chunk below.

We can also prepare an interactive pin map by using the code chunk below.

## tmap mode set to plotting
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

2.2 Converting the spatial point data frame into generic sp format

spatstat requires the analytical data in ppp object form. There is no direct way to convert a SpatialDataFrame into ppp object. We need to convert the SpatialDataFrame into Spatial object first.

The codes below will convert the SpatialPoint and SpatialPolygon data frame into generic spatialpoints and spatialpolygons objects.

What are the differences between SpatialPoints object and SpatialPointDataFrame object?

We see that the childcare_sp does not have the dataframe as found in the SpatialPointDataFrame object. This is because the dataframe is not required. We are only interested in the X/Y coordinates of the SpatialPoints.

2.3 Converting the generic sp format into spatstat’s ppp format

Now, we will use as.ppp() function of spatstat to convert the spatial data into spatstat’s ppp object format.

## Planar point pattern: 1312 points
## window: rectangle = [11203.01, 45404.24] x [25667.6, 49300.88] units

Now, let us plot childcare_ppp and examine the difference.

The code chunk below implements the jittering approach.

2.4 Creating owin

NOTE: The previous parts of the exercise are skipped.

To create owin, the input must be in Poylgon form.

The code chunk below is used to covert sg SpatialPolygon object into owin object of spatstat.

The ouput object can be displayed by using plot() function.

2.5 Combining childcare points and the study area

By using the code below, we are able to extract childcare that is within the specific region to do our analysis later on.

Here we plot the combined childcare point and Punggol region to prove that it works.

NOTE: Instead of the points being bounded by a rectangle, we now bound the study area with the area of Singapore.

3 Kernel Density Estimation

3.1 4 Key Things

  • Input is the data, followed by 4 important paramenters:
    • Bandwidth - “sigma=bw.diggle” helps to compute automatically. Other possible methods are “bw.CvL()”, etc. They are standalone functions, so we can extract them by calling this function externally with the appropriate arguments.
    • Edge -
    • Kernel Type - the smoothing method. Instead of “gaussian”, there are also other methods. Decide which kernel methods to use by using Lesson 4 slides. If we know the distance is fixed and there is no demand beyond that point, we can use uniform kernel function, but this is very unlikely. For example, people may still be willing to travel long distances for a more well-known childcare centre. Remember that there is no one fixed answer. Use the appropriate kernel function.
##    sigma 
## 290.1203

Prof’s exercise shows sigma of 298.4095. This could be because some parts are skipped in this notebook.

##  sigma.x  sigma.y 
## 2303.282 1492.870

The results are not the same because bw.diggle() gives the distance, but bw.scott() returns the extent.

## Warning: point-in-polygon test had difficulty with 215 points (total score not 0
## or 1)

The plot() function of Base R is then used to display the kernel density derived.

NOTE:The values found in kde_childcareSG_bw in very small because the units for SVY projection is in meters, so the value are in units of points/m^2.

In the code chunk below, rescale() is used to covert the unit of measurement from meter to kilometer.

Now, we can re-run density() using the resale data set and plot the output kde map.

## Warning: point-in-polygon test had difficulty with 215 points (total score not 0
## or 1)

We do not have to rescale() in every situation. Check the units for intensity before deciding to convert. For example, we do this for Singapore SVY21 because units are originally in point/m^2.

3.2 Converting KDE output into grid object.

The result is the same, we just convert it so that it is suitable for mapping purposes (a raster layer for tmap to understand).

We need a raster because we can explicitly provide a CRS and we cannot map them properly otherwise.

3.3 Converting gridded output into raster

We convert the gridded kernel density objects into RasterLayer object by using raster() of raster package.

Let us take a look at the properties of kde_childcareSG_bw_raster RasterLayer.

## class      : RasterLayer 
## dimensions : 128, 128, 16384  (nrow, ncol, ncell)
## resolution : 0.4170614, 0.2647348  (x, y)
## extent     : 2.663926, 56.04779, 16.35798, 50.24403  (xmin, xmax, ymin, ymax)
## crs        : NA 
## source     : memory
## names      : v 
## values     : -4.887525e-15, 26.55366  (min, max)

Notice that the crs property is NA.

3.4 Assigning projection systems

The code chunk below will be used to include the CRS information on kde_childcareSG_bw_raster RasterLayer.

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum SVY21 in CRS definition
## class      : RasterLayer 
## dimensions : 128, 128, 16384  (nrow, ncol, ncell)
## resolution : 0.4170614, 0.2647348  (x, y)
## extent     : 2.663926, 56.04779, 16.35798, 50.24403  (xmin, xmax, ymin, ymax)
## crs        : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +units=m +no_defs 
## source     : memory
## names      : v 
## values     : -4.887525e-15, 26.55366  (min, max)

The CRS is now assigned.

4 Lesson 05

We need to define our bandwidth more objectively. One way to do this is to use 2nd order spatial points pattern analysis to understand when data points start to form clusters or when data points have spatial patterns which are statistically significant.

Although there are many functions, the most commonly used function is the L-function which is a transformed version of Ripley’s K-function. L-function is good because there is a red diagonal line in Ripley’s K-function which can be further transformed by deducting the distance for easier interpretation.

A Monte Carlo simulation is used to simulate a distribution multiple times using the envelope() function.

In order to ensure that the results are consistent, we run multiple simulations and combine. Number of simulations depend on confidence level is calculated by 2/significance-level. So a confidence level of 95% requires at least 2/0.05 = 40 simulations. A confidence level of 90% requires at least 2/0.1 = 20 simulations.

The presence of water catchment areas may skew the calculations, so for the hands on exercise, we limit by planning subzone area.

One highlight of the hands-on exercise is to remove duplicates. For second order spatial point analysis, we cannot have 2 points which represent 2 point events. When we run the simulations, we may have weird results. We use rjitter to slightly displace them so that they do not share the same co-ordinate values. There is negligible real-world implications of jittering if the positions are shifted by 1 metre, but it helps during the analysis when duplicates are involved.

There are multiple estimates, and they work the same way.

4.1 Choa Chu Kang Planning Area

4.1.1 Computing L-function Estimation

There is a certainty problem in statistics, so we need to perform hypothesis testing and confirm our hypothesis. Normally, we define the hypothesis and run the test. But we are testing with one empirical data and theoretical data, which is not enough. This is why we need to perform Complete Spatial Randomness test.

4.1.2 Performing Complete Spatial Randomness Test

NOTE: In CSR test, the null hypothesis is always random distribution, and the alternative hypothesis is that the distribution is not random. This means that this is a two-tail hypothesis test. This is because if it is not completely spatially random, it can be clustered or evenly distributed.

To confirm the observed spatial patterns above, a hypothesis test will be conducted. The hypothesis and test are as follows:

Ho = The distribution of childcare services at Choa Chu Kang are randomly distributed.

H1= The distribution of childcare services at Choa Chu Kang are not randomly distributed.

The null hypothesis will be rejected if p-value is smaller than alpha value of 0.001.

Should we run 99 simulations for an alpha-value of 0.001? Based on what we discussed earlier, we should use 2/0.001 = 2000 simulations. 99 simulations are not enough.

The code chunk below is used for hypothesis testing:

## Generating 1999 simulations of CSR  ...
## 1, 2, 3, ................20 [etd 2:46] ...................40 [etd 2:49] ..
## .................60 [etd 2:46] ...................80 [etd 2:52] ....
## ...............100 [etd 3:01] ...................120 [etd 3:11] ......
## .............140 [etd 3:08] ...................160 [etd 3:09] ........
## ...........180 [etd 3:04] ...................200 [etd 2:59] ..........
## .........220 [etd 2:55] ...................240 [etd 2:51] ............
## .......260 [etd 2:48] ...................280 [etd 2:44] ..............
## .....300 [etd 2:42] ...................320 [etd 2:39] ................
## ...340 [etd 2:37] ...................360 [etd 2:34] ..................
## .380 [etd 2:32] ...................400 [etd 2:29] ...................420
##  [etd 2:27] ...................440 [etd 2:24] ...................460 [etd 2:22] ..
## .................480 [etd 2:20] ...................500 [etd 2:18] ....
## ...............520 [etd 2:16] ...................540 [etd 2:14] ......
## .............560 [etd 2:12] ...................580 [etd 2:09] ........
## ...........600 [etd 2:07] ...................620 [etd 2:05] ..........
## .........640 [etd 2:03] ...................660 [etd 2:01] ............
## .......680 [etd 1:59] ...................700 [etd 1:57] ..............
## .....720 [etd 1:55] ...................740 [etd 1:53] ................
## ...760 [etd 1:51] ...................780 [etd 1:49] ..................
## .800 [etd 1:48] ...................820 [etd 1:46] ...................840
##  [etd 1:44] ...................860 [etd 1:42] ...................880 [etd 1:40] ..
## .................900 [etd 1:38] ...................920 [etd 1:36] ....
## ...............940 [etd 1:34] ...................960 [etd 1:32] ......
## .............980 [etd 1:31] ...................1000 [etd 1:29] ........
## ...........1020 [etd 1:27] ...................1040 [etd 1:25] ..........
## .........1060 [etd 1:23] ...................1080 [etd 1:21] ............
## .......1100 [etd 1:20] ...................1120 [etd 1:18] ..............
## .....1140 [etd 1:16] ...................1160 [etd 1:14] ................
## ...1180 [etd 1:12] ...................1200 [etd 1:11] ..................
## .1220 [etd 1:09] ...................1240 [etd 1:07] ...................1260
##  [etd 1:05] ...................1280 [etd 1:04] ...................1300 [etd 1:02] ..
## .................1320 [etd 1:00] ...................1340 [etd 58 sec] ....
## ...............1360 [etd 57 sec] ...................1380 [etd 55 sec] ......
## .............1400 [etd 53 sec] ...................1420 [etd 51 sec] ........
## ...........1440 [etd 49 sec] ...................1460 [etd 48 sec] ..........
## .........1480 [etd 46 sec] ...................1500 [etd 44 sec] ............
## .......1520 [etd 42 sec] ...................1540 [etd 40 sec] ..............
## .....1560 [etd 39 sec] ...................1580 [etd 37 sec] ................
## ...1600 [etd 35 sec] ...................1620 [etd 33 sec] ..................
## .1640 [etd 32 sec] ...................1660 [etd 30 sec] ...................1680
##  [etd 28 sec] ...................1700 [etd 26 sec] ...................1720 [etd 25 sec] ..
## .................1740 [etd 23 sec] ...................1760 [etd 21 sec] ....
## ...............1780 [etd 19 sec] ...................1800 [etd 18 sec] ......
## .............1820 [etd 16 sec] ...................1840 [etd 14 sec] ........
## ...........1860 [etd 12 sec] ...................1880 [etd 11 sec] ..........
## .........1900 [etd 9 sec] ...................1920 [etd 7 sec] ............
## .......1940 [etd 5 sec] ...................1960 [etd 3 sec] ..............
## .....1980 [etd 2 sec] .................. 1999.
## 
## Done.

Why nsim = 1999? Because for nsim, R starts counting from 0 and not from 1.

Running 2000 simulations would be quite slow. The envelope() has his name because the graph envelopes the results of these 2000 simulations.

We then draw conclusions based on this envelope.

If we look at Tampines, we cannot reject the null hypothesis within a certain distance (130 m). Outside of this distance, we can reject the null hypothesis and determine if it is clustered or regularly distributed. In this case, since it is above the empirical line, it is a sign of clustering.

This means that for distances above 130 m, there are signs of clusters forming.

4.2 Determining Kernel Bandwidth

Based on this, we can decide to use 130 m as our kernel bandwidth. This is because we use the point where the empirical cuts the envelope, as this point is where there is now statistical significance. We then determine if this point is below of above the theoretical line to determine if there is clustering or regular distribution.