Ranking quality of potable water in Cap-Haitian Sanitation Zones

Basic Spatial Point Pattern Analysis in R

A wide variety of different phenomena can be modeled as points. These include physically static features like building locations, or dynamic events like household survey or water utilities.

There are a variety of techniques for analyzing point data that can not only provide useful visualizations, but also provide useful insights that can inform action and programs. This GIS article will cover mainly hotpot analysis techniques.

Baseline Data

WATSAN baseline data is commonly made available as spreadsheet data, with the X_longitude and Y_latitude columns representing the location of respondents. These data are officially tagged as unuseful for the project activities. Nonetheless, they can be used for other purposes like Sanitation zoning and other fields of study.

Exploring Baseline Data

Loading the data with read.csv(), exploring the column names, and listing the first row gives some sense of what is available. The as.is=T parameter is needed so strings are not converted to factors, which makes them difficult to compare between different tables or objects.

The data can be converted to a spatial object using the latitude and longitude. These values are converted in a State Plane Coordinate System which is the Universal Tranverse Mercator (UTM)

setwd("c:/users/dpierre/documents")
baseline <- read.csv("watsan_baseline_data.csv", as.is=T)

# Remove households with no lat/long
baseline <- baseline[!is.na(baseline$X_latitude),]
baseline <- baseline[!is.na(baseline$X_longitude),]

# Convert to an object
wgs84 = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
utm  = CRS('+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0')
Sp_baseline = SpatialPointsDataFrame(baseline[,c('X_longitude', 'X_latitude')], baseline, proj4string = wgs84)

plot(Sp_baseline)

The plot shows most of the points clustered in a small area at the top of the chart with an aberrant point at the bottom left, probably at (0,0). This indicates the presence of incorrectly geocoded points that must be removed.

WATSAN spatial database makes available a spatial data of neighborhoods that can be used both for aggregation, and to remove the incorrectly geocoded points.

# Connect to the PostgreSQL database
options(pgd = list(
  "host" = "postgresql-sdlo.alwaysdata.net",
  "port" = 5432,
  "user" = "sdlo_reader",
  "password" = "sdloreader",
  "dbname" = "sdlo_db",
  "drv" = dbDriver("PostgreSQL")
))

db <- dbConnect(drv = options()$pgd$drv, dbname = options()$pgd$dbname, host = options()$pgd$host, 
                port = options()$pgd$port, user = options()$pgd$user, 
                password = options()$pgd$password)
# Construct the fetching query
query1 <- paste0("SELECT ST_AsText(geom_neighborhood) FROM tbl_neighborhood WHERE cte =", "'",'ctecaphaitien',"'")
query2 <- paste0("SELECT neighborhood_id, name, sup_km2 FROM tbl_neighborhood WHERE cte =", "'",'ctecaphaitien',"'")

# Submit the fetch query and disconnect
polyg <-  suppressWarnings(dbGetQuery(db, query1))
dt <-  suppressWarnings(dbGetQuery(db, query2))
dbDisconnect(db)

## [1] TRUE

spdf <- 
  lapply(seq_len(nrow(polyg)),
         function(i){
           SpatialPolygonsDataFrame(readWKT(polyg[i, ]), dt[i, ], match.ID = FALSE)
         })
neighborhoods <- do.call("rbind", spdf)
proj4string(neighborhoods) <- utm
plot(neighborhoods, col=terrain.colors(nrow(neighborhoods)))

The over() overlay function can be used to merge the baseline data points with the neighborhood names, and filter points that do not fall in neighborhoods

Sp_baseline <-  spTransform(Sp_baseline, utm)
Sp_baseline$NBHD_NAME = over(Sp_baseline, neighborhoods)$name
Sp_baseline = Sp_baseline[!is.na(Sp_baseline$NBHD_NAME),]

plot(Sp_baseline, col='blue')
plot(neighborhoods, border='black', add=T)

Aggregating by Area

A common and quick way of analyzing potable water quality patterns is by aggregating mean of ranking within areas like neighborhoods:

neighborhoods$rank_quality_potable = over(neighborhoods, Sp_baseline, fn = mean)$rank_quality_potable
trellis.par.set(add.line = list(col = "transparent"))
scale = list("SpatialPolygonsRescale", layout.scale.bar(), 
             offset = c(795000,2180000), scale = 2000, fill=c("transparent","black"), which=1)
text1 = list("sp.text", c(795000,2179700), "0",which=1, cex = .7)
text2 = list("sp.text", c(797500,2179700), "1000 m",which=1, cex = .7)
arrow = list("SpatialPolygonsRescale", layout.north.arrow(type=1), 
             offset = c(785000, 2188000), scale = 1000, which=1)
# plotting with scale bar, north arrow, and Title
spplot(neighborhoods, zcol='rank_quality_potable', 
       colorkey=list(space="right", align="bottom"), 
       scales = list(draw = TRUE), main =  "Rank mean of water quality by neighborhood", 
       col.regions = colorRampPalette(c("yellow","orange","purple","blue","dark blue"))(16), 
       ylab = "Northing (m)", xlab = "Easting (m)",
       sp.layout = list(scale,text1,text2,arrow),
       as.table = TRUE
       )
grid.text("rank mean", x=unit(0.90, "npc"), y=unit(0.50, "npc"), rot=-90, check.overlap = TRUE)

The Modifiable Areal Unit Problem

A major issue with any kind of aggregation by area is that the results of your analysis can vary widely depending on where you draw the boundaries. This is the Modifiable Areal Unit Problem (MAUP), and can be illustrated by the different patterns displayed when household baseline survey is aggregated by neighborhood rather than by buildings. Compensating for the MAUP requires use of techniques like the ones described below.

Heat Maps (Kernel Density Estimation)

One technique that avoids the MAUP is the statistical technique kernel density estimation (KDE). When used with spatial data, KDE is used to create heat maps.

KDE involves systematically running a small kernel matrix over the area being analyzed to visually spread the effects of potable water points over adjacent space. On a point map, locations where large numbers of potable water points occur close to each other would not be clear, but using a kernel to spread that affect and color it more intensely makes hot spots stand out.

To create a heat map, you start by converting the points to a raster, or regular grid of areas or pixels, using the rasterize() function from the raster library. In this case, since our coordinate system is in meter, a pixel size of 100 means each grid cell is 100 x 100 meter. This choice of size is arbitrary, but should not be too large to avoid making the pixels into areas large enough to run into the MAUP.

Because of this small pixel size, individual potable water points counts in each pixel are not visually obvious when that raster is visualized.

library(raster)

pixelsize = 100
box = round(extent(neighborhoods) / pixelsize) * pixelsize
template = raster(box, crs = utm,
    nrows = (box@ymax - box@ymin) / pixelsize, 
    ncols = (box@xmax - box@xmin) / pixelsize)

raster2015 = rasterize(Sp_baseline, template, field = 'rank_quality_potable', fun = mean)
plot(raster2015, xlim=c(787000, 798000), ylim=c(2181000, 2189000))
plot(neighborhoods, border='#00000040', add=T)

Next, the focal() function is used to run a Gaussian smoothing kernel created with the focalWeight() function over the raster of potable water points counts to create a new heat map raster. The heat map clearly shows hot spots of potable water access in a variety of areas, notably the South-East area where are F8, F10,and F11 DINEPA water pumping station. Because the counts of these water points are smoothed over adjacent areas, the intensity of areas relative to other areas on the map is more significant than the absolute numeric values in specific pixels.

kernel = focalWeight(raster2015, d = 250, type = 'Gauss')
heat2015 = focal(raster2015, kernel, fun = sum, na.rm=T)
plot(heat2015, xlim=c(787000, 798000), ylim=c(2181000, 2189000))
plot(neighborhoods, border='#00000040', add=T)

Contour Lines

In a manner similar to the use of contour lines on an topographic elevation map, contour lines can be created from heat maps to circle hot spots where the level of water access exceeds a certain threshold.

The choice of threshold is made by trial-and-error. Since the absolute value of individual pixel numbers is largely meaningless, there is no absolutely right threshold number, and the choice of an appropriate number requires some knowledge of specific neighborhood conditions to know whether it is too low (including areas that are not really hot spots) or too high (missing important areas of water access concentration). This is a problem similar to the MAUP in that arbitrary or historic choices can significantly alter the results of your analysis.

This code uses the rasterToPolygons() function to create polygons for each pixel where the value exceeds the threshold. The gUnaryUnion() function from the rgeos library is then used to dissolve the lines between adjacent pixel polygons and form outlines. The gBuffer() function is then used to expand and smooth the outlines so they are easier to read.

library(rgeos)

threshold = 0.8
polygons = rasterToPolygons(x = heat2015, n=16, fun = function(x) { x >= threshold })
contours2015 = gBuffer(gUnaryUnion(polygons), width=100)

plot(heat2015, xlim=c(787000, 798000), ylim=c(2181000, 2189000))
plot(contours2015, border='blue', add=T)

Getis-Ord GI*

One major issue with kernel density estimation is that the choice of a threshold for what is and is not a hot spot is arbitrary, making use of KDE imprecise and subject to misinterpretation.

One commonly used technique to work around that issues was developed by Arthur Getis and JK Ord in the early 1990s. When you develop an algorithm, you get to name it after yourself, so this is called the Getis-Ord GI* statistic.

This statistic is based on the observation that the distribution of differences in potable water point intensity between neighborhood areas across a map will follow a normal curve. Therefore, the amount of difference between an area and its neighbors can then be converted to a z-score reflecting the number of standard deviations (σ) that the potable water point access level in an area differs from neighborhood normal. Areas with high z-scores (indicating a potable water access significantly above the mean) are identified as hot spots, while areas with low z-scores are identified as cool spots

# Create a regular grid of 300*300 meters potable water access areas via a raster
pixelsize = 300
box = round(extent(neighborhoods) / pixelsize) * pixelsize
template = raster(box, crs = utm,
    nrows = (box@ymax - box@ymin) / pixelsize, 
    ncols = (box@xmax - box@xmin) / pixelsize)

getisraster = rasterize(Sp_baseline, template, field = 'rank_quality_potable', fun = mean)
getisgrid = rasterToPolygons(getisraster)

# Create the list of neighbors
library(spdep)
neighbors = poly2nb(getisgrid)
weighted_neighbors = nb2listw(neighbors, zero.policy = T)

# Perform the local G analysis (Getis-Ord GI*)
getisgrid$HOTSPOT = as.vector(localG(getisgrid$layer, weighted_neighbors))

# Color the grid cells based on the z-score
breaks = c(-20, -1.96, -1, 1, 1.96, 20)
palette=c("#0000FF80", "#8080FF80", "#FFFFFF80", "#FF808080", "#FF000080")
col = palette[cut(getisgrid$HOTSPOT, breaks)]

# Plot
plot(getisgrid, col=col, border=NA, xlim=c(787000, 798000), ylim=c(2181000, 2189000))
plot(neighborhoods, border="gray", add=T)
legend(x='topleft', inset=0.01, bg = 'white', col=palette, pch=15, cex = .8,
    legend = c('Cold Spot >99% Confidence', 'Cold Spot 68-95% Confidence', 
        'Not Significant', 'Hot Spot 68 - 95% Confidence',
        'Hot Spot >95% Confidence'))