In this in-class exercise, three R packages will be used:
packages = c('rgdal', 'maptools', 'raster','spatstat', 'tmap')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}## Warning: package 'maptools' was built under R version 4.0.2
## Warning: package 'spatstat' was built under R version 4.0.2
## Warning: package 'spatstat.data' was built under R version 4.0.2
## Warning: package 'nlme' was built under R version 4.0.2
## Warning: package 'tmap' was built under R version 4.0.2
In this section, readOGR() of rgdal package will be used to import the three geospatial data in R’s SpatialPolygonsDataFrame or SpatialPointsDataFrame.These data are not in simple features, but rather, they are in Spatial classes.
## OGR data source with driver: ESRI Shapefile
## Source: "D:\IS415-AY2020-21T1\06-Submission\In-class Exercise\In-class_Ex05\In-class_Ex05_CHEONG WEI HERNG EUGENE\In Class Exercise 05\data", layer: "CHILDCARE"
## with 1312 features
## It has 18 fields
## OGR data source with driver: ESRI Shapefile
## Source: "D:\IS415-AY2020-21T1\06-Submission\In-class Exercise\In-class_Ex05\In-class_Ex05_CHEONG WEI HERNG EUGENE\In Class Exercise 05\data", layer: "CostalOutline"
## with 60 features
## It has 4 fields
## OGR data source with driver: ESRI Shapefile
## Source: "D:\IS415-AY2020-21T1\06-Submission\In-class Exercise\In-class_Ex05\In-class_Ex05_CHEONG WEI HERNG EUGENE\In Class Exercise 05\data", layer: "MP14_SUBZONE_WEB_PL"
## with 323 features
## It has 15 fields
We can plot these geospatial data in one plot by using code chunk below.
We can also prepare an interactive pin map by using the code chunk below.
## tmap mode set to plotting
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
spatstat requires the analytical data in ppp object form. There is no direct way to convert a SpatialDataFrame into ppp object. We need to convert the SpatialDataFrame into Spatial object first.
The codes below will convert the SpatialPoint and SpatialPolygon data frame into generic spatialpoints and spatialpolygons objects.
What are the differences between SpatialPoints object and SpatialPointDataFrame object?
We see that the childcare_sp does not have the dataframe as found in the SpatialPointDataFrame object. This is because the dataframe is not required. We are only interested in the X/Y coordinates of the SpatialPoints.
Now, we will use as.ppp() function of spatstat to convert the spatial data into spatstat’s ppp object format.
## Planar point pattern: 1312 points
## window: rectangle = [11203.01, 45404.24] x [25667.6, 49300.88] units
Now, let us plot childcare_ppp and examine the difference.
The code chunk below implements the jittering approach.
NOTE: The previous parts of the exercise are skipped.
To create owin, the input must be in Poylgon form.
The code chunk below is used to covert sg SpatialPolygon object into owin object of spatstat.
The ouput object can be displayed by using plot() function.
By using the code below, we are able to extract childcare that is within the specific region to do our analysis later on.
Here we plot the combined childcare point and Punggol region to prove that it works.
NOTE: Instead of the points being bounded by a rectangle, we now bound the study area with the area of Singapore.
## sigma
## 290.1203
Prof’s exercise shows sigma of 298.4095. This could be because some parts are skipped in this notebook.
## sigma.x sigma.y
## 2303.282 1492.870
The results are not the same because bw.diggle() gives the distance, but bw.scott() returns the extent.
## Warning: point-in-polygon test had difficulty with 215 points (total score not 0
## or 1)
The plot() function of Base R is then used to display the kernel density derived.
NOTE:The values found in kde_childcareSG_bw in very small because the units for SVY projection is in meters, so the value are in units of points/m^2.
In the code chunk below, rescale() is used to covert the unit of measurement from meter to kilometer.
Now, we can re-run density() using the resale data set and plot the output kde map.
## Warning: point-in-polygon test had difficulty with 215 points (total score not 0
## or 1)
We do not have to rescale() in every situation. Check the units for intensity before deciding to convert. For example, we do this for Singapore SVY21 because units are originally in point/m^2.
The result is the same, we just convert it so that it is suitable for mapping purposes (a raster layer for tmap to understand).
gridded_kde_childcareSG_bw <- as.SpatialGridDataFrame.im(kde_childcareSG.bw)
spplot(gridded_kde_childcareSG_bw)We need a raster because we can explicitly provide a CRS and we cannot map them properly otherwise.
We convert the gridded kernel density objects into RasterLayer object by using raster() of raster package.
Let us take a look at the properties of kde_childcareSG_bw_raster RasterLayer.
## class : RasterLayer
## dimensions : 128, 128, 16384 (nrow, ncol, ncell)
## resolution : 0.4170614, 0.2647348 (x, y)
## extent : 2.663926, 56.04779, 16.35798, 50.24403 (xmin, xmax, ymin, ymax)
## crs : NA
## source : memory
## names : v
## values : -4.887525e-15, 26.55366 (min, max)
Notice that the crs property is NA.
The code chunk below will be used to include the CRS information on kde_childcareSG_bw_raster RasterLayer.
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum SVY21 in CRS definition
## class : RasterLayer
## dimensions : 128, 128, 16384 (nrow, ncol, ncell)
## resolution : 0.4170614, 0.2647348 (x, y)
## extent : 2.663926, 56.04779, 16.35798, 50.24403 (xmin, xmax, ymin, ymax)
## crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +units=m +no_defs
## source : memory
## names : v
## values : -4.887525e-15, 26.55366 (min, max)
The CRS is now assigned.
We need to define our bandwidth more objectively. One way to do this is to use 2nd order spatial points pattern analysis to understand when data points start to form clusters or when data points have spatial patterns which are statistically significant.
Although there are many functions, the most commonly used function is the L-function which is a transformed version of Ripley’s K-function. L-function is good because there is a red diagonal line in Ripley’s K-function which can be further transformed by deducting the distance for easier interpretation.
A Monte Carlo simulation is used to simulate a distribution multiple times using the envelope() function.
In order to ensure that the results are consistent, we run multiple simulations and combine. Number of simulations depend on confidence level is calculated by 2/significance-level. So a confidence level of 95% requires at least 2/0.05 = 40 simulations. A confidence level of 90% requires at least 2/0.1 = 20 simulations.
The presence of water catchment areas may skew the calculations, so for the hands on exercise, we limit by planning subzone area.
One highlight of the hands-on exercise is to remove duplicates. For second order spatial point analysis, we cannot have 2 points which represent 2 point events. When we run the simulations, we may have weird results. We use rjitter to slightly displace them so that they do not share the same co-ordinate values. There is negligible real-world implications of jittering if the positions are shifted by 1 metre, but it helps during the analysis when duplicates are involved.
There are multiple estimates, and they work the same way.
ck = mpsz[mpsz@data$PLN_AREA_N == "CHOA CHU KANG",]
childcare_sp = as(childcare, "SpatialPoints")
ck_sp = as(ck, "SpatialPolygons")
ck_owin = as(ck_sp, "owin")
childcare_ck_ppp = childcare_ppp_jit[ck_owin]
L_ck = Lest(childcare_ck_ppp, correction = "Ripley")
plot(L_ck, . -r ~ r,
ylab= "L(d)-r", xlab = "d(m)")There is a certainty problem in statistics, so we need to perform hypothesis testing and confirm our hypothesis. Normally, we define the hypothesis and run the test. But we are testing with one empirical data and theoretical data, which is not enough. This is why we need to perform Complete Spatial Randomness test.
NOTE: In CSR test, the null hypothesis is always random distribution, and the alternative hypothesis is that the distribution is not random. This means that this is a two-tail hypothesis test. This is because if it is not completely spatially random, it can be clustered or evenly distributed.
To confirm the observed spatial patterns above, a hypothesis test will be conducted. The hypothesis and test are as follows:
Ho = The distribution of childcare services at Choa Chu Kang are randomly distributed.
H1= The distribution of childcare services at Choa Chu Kang are not randomly distributed.
The null hypothesis will be rejected if p-value is smaller than alpha value of 0.001.
Should we run 99 simulations for an alpha-value of 0.001? Based on what we discussed earlier, we should use 2/0.001 = 2000 simulations. 99 simulations are not enough.
The code chunk below is used for hypothesis testing:
## Generating 1999 simulations of CSR ...
## 1, 2, 3, ................20 [etd 2:46] ...................40 [etd 2:49] ..
## .................60 [etd 2:46] ...................80 [etd 2:52] ....
## ...............100 [etd 3:01] ...................120 [etd 3:11] ......
## .............140 [etd 3:08] ...................160 [etd 3:09] ........
## ...........180 [etd 3:04] ...................200 [etd 2:59] ..........
## .........220 [etd 2:55] ...................240 [etd 2:51] ............
## .......260 [etd 2:48] ...................280 [etd 2:44] ..............
## .....300 [etd 2:42] ...................320 [etd 2:39] ................
## ...340 [etd 2:37] ...................360 [etd 2:34] ..................
## .380 [etd 2:32] ...................400 [etd 2:29] ...................420
## [etd 2:27] ...................440 [etd 2:24] ...................460 [etd 2:22] ..
## .................480 [etd 2:20] ...................500 [etd 2:18] ....
## ...............520 [etd 2:16] ...................540 [etd 2:14] ......
## .............560 [etd 2:12] ...................580 [etd 2:09] ........
## ...........600 [etd 2:07] ...................620 [etd 2:05] ..........
## .........640 [etd 2:03] ...................660 [etd 2:01] ............
## .......680 [etd 1:59] ...................700 [etd 1:57] ..............
## .....720 [etd 1:55] ...................740 [etd 1:53] ................
## ...760 [etd 1:51] ...................780 [etd 1:49] ..................
## .800 [etd 1:48] ...................820 [etd 1:46] ...................840
## [etd 1:44] ...................860 [etd 1:42] ...................880 [etd 1:40] ..
## .................900 [etd 1:38] ...................920 [etd 1:36] ....
## ...............940 [etd 1:34] ...................960 [etd 1:32] ......
## .............980 [etd 1:31] ...................1000 [etd 1:29] ........
## ...........1020 [etd 1:27] ...................1040 [etd 1:25] ..........
## .........1060 [etd 1:23] ...................1080 [etd 1:21] ............
## .......1100 [etd 1:20] ...................1120 [etd 1:18] ..............
## .....1140 [etd 1:16] ...................1160 [etd 1:14] ................
## ...1180 [etd 1:12] ...................1200 [etd 1:11] ..................
## .1220 [etd 1:09] ...................1240 [etd 1:07] ...................1260
## [etd 1:05] ...................1280 [etd 1:04] ...................1300 [etd 1:02] ..
## .................1320 [etd 1:00] ...................1340 [etd 58 sec] ....
## ...............1360 [etd 57 sec] ...................1380 [etd 55 sec] ......
## .............1400 [etd 53 sec] ...................1420 [etd 51 sec] ........
## ...........1440 [etd 49 sec] ...................1460 [etd 48 sec] ..........
## .........1480 [etd 46 sec] ...................1500 [etd 44 sec] ............
## .......1520 [etd 42 sec] ...................1540 [etd 40 sec] ..............
## .....1560 [etd 39 sec] ...................1580 [etd 37 sec] ................
## ...1600 [etd 35 sec] ...................1620 [etd 33 sec] ..................
## .1640 [etd 32 sec] ...................1660 [etd 30 sec] ...................1680
## [etd 28 sec] ...................1700 [etd 26 sec] ...................1720 [etd 25 sec] ..
## .................1740 [etd 23 sec] ...................1760 [etd 21 sec] ....
## ...............1780 [etd 19 sec] ...................1800 [etd 18 sec] ......
## .............1820 [etd 16 sec] ...................1840 [etd 14 sec] ........
## ...........1860 [etd 12 sec] ...................1880 [etd 11 sec] ..........
## .........1900 [etd 9 sec] ...................1920 [etd 7 sec] ............
## .......1940 [etd 5 sec] ...................1960 [etd 3 sec] ..............
## .....1980 [etd 2 sec] .................. 1999.
##
## Done.
Why nsim = 1999? Because for nsim, R starts counting from 0 and not from 1.
Running 2000 simulations would be quite slow. The envelope() has his name because the graph envelopes the results of these 2000 simulations.
We then draw conclusions based on this envelope.
If we look at Tampines, we cannot reject the null hypothesis within a certain distance (130 m). Outside of this distance, we can reject the null hypothesis and determine if it is clustered or regularly distributed. In this case, since it is above the empirical line, it is a sign of clustering.
This means that for distances above 130 m, there are signs of clusters forming.
Based on this, we can decide to use 130 m as our kernel bandwidth. This is because we use the point where the empirical cuts the envelope, as this point is where there is now statistical significance. We then determine if this point is below of above the theoretical line to determine if there is clustering or regular distribution.