Spatial Point Patterns

The data for this lab are on the dropbox folder. There is a file called north_derby.zip. Unzip it, and there will be four shapefiles inside.

library(rgdal)
data.dir <- "/Users/nicholasnagle/Dropbox/classes/geog_515_s12/Data/north_derby"
spasthma <- readOGR(dsn = data.dir, "spasthma")  # asthma point data

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/nicholasnagle/Dropbox/classes/geog_515_s12/Data/north_derby", layer: "spasthma"
## with 1291 features and 9 fields
## Feature type: wkbPoint with 2 dimensions

spbdry <- readOGR(dsn = data.dir, "spbdry")  # boundary layer

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/nicholasnagle/Dropbox/classes/geog_515_s12/Data/north_derby", layer: "spbdry"
## with 1 features and 1 fields
## Feature type: wkbPolygon with 2 dimensions

spsrc <- readOGR(dsn = data.dir, "spsrc")  # source layer

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/nicholasnagle/Dropbox/classes/geog_515_s12/Data/north_derby", layer: "spsrc"
## with 3 features and 1 fields
## Feature type: wkbPoint with 2 dimensions

sproads <- readOGR(dsn = data.dir, "sproads")  # road layer

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/nicholasnagle/Dropbox/classes/geog_515_s12/Data/north_derby", layer: "sproads"
## with 1 features and 1 fields
## Feature type: wkbMultiLineString with 2 dimensions

These data represent a random sample of children in the English town of North Derby. There are children with asthma (cases) and children without (control). In addition we have the location of motorways and of three large point sources of air pollutants.

# Plot the data in ggplot
library(ggplot2)
# Coerce all the data to data.frames for ggplot
ggbdry <- fortify(spbdry)

## Regions defined for each Polygons

ggasthma <- data.frame(long = coordinates(spasthma)[, 1], lat = coordinates(spasthma)[, 
    2], asthma = spasthma$Asthma)
ggsrc <- data.frame(long = coordinates(spsrc)[, 1], lat = coordinates(spsrc)[, 
    2], source = spsrc$Source)
ggroads <- fortify(sproads)

map <- ggplot(aes(x = long, y = lat), data = ggasthma)  #Initialize map with data
# Add points to map
map <- map + geom_point(aes(colour = asthma))
# Add boundary to map
map <- map + geom_path(aes(x = long, y = lat), data = ggbdry)
# Add roads to map, make them grey
map <- map + geom_path(aes(x = long, y = lat, group = group), data = ggroads, 
    colour = "grey50")
# Add point sources to map, make them red
map <- map + geom_point(aes(x = long, y = lat), data = ggsrc, colour = "red", 
    shape = 17, size = 10)
# And print the map to screen, making the axes equal along the way
map + coord_equal()

plot of chunk unnamed-chunk-2

So the big question is: are the cases clustered? But, clustered relative to what? Clustered geographically? Sure. But so is the population, as indicated by the controls. What we really want to know is are the cases clustered relative to population (i.e. the controls). Also, are the cases clustered around each other, or are they clustered around the point sources.

Analysis using spdep

spdep has a new data structure. Instead of SpatialPoints, it uses ppp objects, and insteady of SpatialPolygons, it uses owin objects. I find these objects awkward to work with, and it is a bit difficult to go back and forth between them. One advantage of the ppp is that a point pattern not only has the points, but it also has a boundary. A ppp has points and and owin inside it. This emphasizes that the boundary definition is a fundamental part of point pattern analysis.

library(spatstat)

## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.7-27. For overview type 'help("mgcv-package")'.
## Loading required package: deldir
## deldir 0.1-1
## 
## spatstat 1.33-0       (nickname: 'Titanic Deckchair') 
## For an introduction to spatstat, type 'beginner'

# Create the owin.  It is the more difficult.  We need to coordinates of the
# polygon.  coordinates(spbdry) That doesn't work. coordinates() returns the
# centroid, not the boundary.  A spatialPolygon is a list of polygons.  We
# only have one polygon though, so it is a list of length 1.  So I tried
# this:
owin(poly = spbdry@polygons[[1]]@Polygons[[1]]@coords)

## Error: Area of polygon is negative - maybe traversed in wrong direction?

# Ignore the error for now.  I got that inner part by looking at this and
# following the tree down to the coords
str(spbdry)

## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  1 obs. of  1 variable:
##   .. ..$ x: num 1
##   ..@ polygons   :List of 1
##   .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots
##   .. .. .. ..@ Polygons :List of 1
##   .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
##   .. .. .. .. .. .. ..@ labpt  : num [1:2] 0.461 0.47
##   .. .. .. .. .. .. ..@ area   : num 0.395
##   .. .. .. .. .. .. ..@ hole   : logi FALSE
##   .. .. .. .. .. .. ..@ ringDir: int 1
##   .. .. .. .. .. .. ..@ coords : num [1:22, 1:2] 0.827 0.513 0.57 0.949 0.976 ...
##   .. .. .. ..@ plotOrder: int 1
##   .. .. .. ..@ labpt    : num [1:2] 0.461 0.47
##   .. .. .. ..@ ID       : chr "0"
##   .. .. .. ..@ area     : num 0.395
##   ..@ plotOrder  : int 1
##   ..@ bbox       : num [1:2, 1:2] -0.0162 -0.0151 0.9757 0.868
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "x" "y"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
##   .. .. ..@ projargs: chr NA

# But when I got that error, I was swore a bit asked 'are you joking?  the
# poygons have to be specified in a specific direction?'  Well, we can
# reverse the order of the points
owin(poly = spbdry@polygons[[1]]@Polygons[[1]]@coords[seq(from = 22, to = 2, 
    by = -1), ])

## window: polygonal boundary
## enclosing rectangle: [-0.0162, 0.9757] x [-0.0151, 0.868] units

# That worked.  What a pain.
W <- owin(poly = spbdry@polygons[[1]]@Polygons[[1]]@coords[seq(from = 22, to = 2, 
    by = -1), ])
plot(W)

plot of chunk unnamed-chunk-3



# Now, create the ppp
xy <- coordinates(spasthma)
asthma.ppp <- ppp(x = xy[, 1], y = xy[, 2], window = W)

## Warning: data contain duplicated points

# Ignore the duplicates warning

plot(asthma.ppp)

plot of chunk unnamed-chunk-3


# Plot K function
plot(Kest(asthma.ppp))

plot of chunk unnamed-chunk-3

##        lty col    key            label
## iso      1   1    iso   hat(K)[iso](r)
## trans    2   2  trans hat(K)[trans](r)
## border   3   3 border  hat(K)[bord](r)
## theo     4   4   theo       K[pois](r)
##                                             meaning
## iso    Ripley isotropic correction estimate of K(r)
## trans        translation-corrected estimate of K(r)
## border            border-corrected estimate of K(r)
## theo                       theoretical Poisson K(r)

env <- envelope(Y = asthma.ppp, fun = Kest, nsim = 39)

## Generating 39 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
## 31, 32, 33, 34, 35, 36, 37, 38,  39.
## 
## Done.

plot(env)

plot of chunk unnamed-chunk-3

##      lty col  key      label
## obs    1   1  obs  K[obs](r)
## theo   2   2 theo K[theo](r)
## hi     1   8   hi   K[hi](r)
## lo     1   8   lo   K[lo](r)
##                                                meaning
## obs            observed value of K(r) for data pattern
## theo                 theoretical value of K(r) for CSR
## hi   upper pointwise envelope of K(r) from simulations
## lo   lower pointwise envelope of K(r) from simulations


# Note, the K function is a little difficult to visualize because of the
# dominant quadratic trend.  Remove that...  First, create a sequence of
# distances to estimate the K- function at:
r <- seq(0, 0.2, by = 0.01)
# The reason for specifying this directly, is we also need it in order to
# transform K in to L
env <- envelope(Y = asthma.ppp, fun = Kest, r = r, nsim = 29, transform = expression(sqrt(./pi) - 
    r))

## Generating 29 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
## 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,  29.
## 
## Done.

plot(env)

plot of chunk unnamed-chunk-3

##      lty col  key                   label
## obs    1   1  obs  sqrt(K/pi) - r[obs](r)
## theo   2   2 theo sqrt(K/pi) - r[theo](r)
## hi     1   8   hi   sqrt(K/pi) - r[hi](r)
## lo     1   8   lo   sqrt(K/pi) - r[lo](r)
##                                                             meaning
## obs            observed value of sqrt(K(r)/pi) - r for data pattern
## theo                 theoretical value of sqrt(K(r)/pi) - r for CSR
## hi   upper pointwise envelope of sqrt(K(r)/pi) - r from simulations
## lo   lower pointwise envelope of sqrt(K(r)/pi) - r from simulations

The population data are strongly clustered. But what's with the double peaks? Is there something multiscalar going on? Perhaps. But I think that the second peak is most likely due to the L-function reaching across the non-convex boundaries. There are holes, in which there are likely people, but they are outside of the study area.

K function with marked point pattern

marks(asthma.ppp) <- spasthma$Asthma
# How many point are there?
summary(asthma.ppp)

## Marked planar point pattern: 1291 points
## Average intensity 3270 points per square unit  
## 
## *Pattern contains duplicated points*
## 
## Coordinates are given to 8 decimal places
## 
## Multitype:
##         frequency proportion intensity
## case          215      0.167       544
## control      1080      0.833      2720
## 
## Window: polygonal boundary
## single connected closed polygon with 21 vertices
## enclosing rectangle: [-0.0162, 0.9757] x [-0.0151, 0.868] units 
## Window area =  0.395175 square units

# There are 1291 points, of which 215 are cases.  What is the K-function of
# the cases?
r <- seq(0, 0.2, by = 0.01)
K.dat <- Kest(asthma.ppp[marks(asthma.ppp) == "case", ], r = r)
# Create storage space for the simulations
K.storage <- matrix(NA, nrow = length(r), ncol = 40)
K.storage[, 1] <- K.dat$border
for (k in 2:40) {
    # randomly create 215 'cases'
    sim.case <- sample(1291, 215)
    K.storage[, k] <- Kest(asthma.ppp[sim.case, ], r = r)$border
}
library(plyr)  # I'll use the adply function to calcualte the quantile of each row
# Extract the 5th and 95th quantiles of each row to create a 90% confidence
# interval
K.env <- adply(K.storage, 1, quantile, c(0.05, 0.95), na.rm = TRUE)
plot(r, K.dat$border, type = "l")
lines(r, K.env[, 2], lty = 2)
lines(r, K.env[, 3], lty = 2)

plot of chunk unnamed-chunk-4

There is not really any significant evidence of clustering beyong that of the population. If you were to randomly pick 215 cases from the population, you could get a K-function just like the K function of the asthma cases.

Regression with marked point patterns

Each point has a mark: case or control. Are points more likely to be cases if they are near roads? Near point sources? This is a question for logistic regression.

asthma.df <- spasthma@data
asthma.df$x <- coordinates(spasthma)[, 1]
asthma.df$y <- coordinates(spasthma)[, 2]
asthma.df$smoking <- asthma.df$Nsmokers > 0
# There is some missing gender data... remove those
model <- glm(Asthma ~ d2source1 + d2source2 + d2source3 + smoking + HayFever + 
    Age + Gender + roaddist2, data = asthma.df, family = binomial, subset = Gender != 
    0)
summary(model)

## 
## Call:
## glm(formula = Asthma ~ d2source1 + d2source2 + d2source3 + smoking + 
##     HayFever + Age + Gender + roaddist2, family = binomial, data = asthma.df, 
##     subset = Gender != 0)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.390   0.435   0.527   0.610   1.205  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.08e-01   8.50e-01   -0.83    0.405    
## d2source1    1.16e+02   5.45e+01    2.13    0.033 *  
## d2source2   -1.94e+02   1.09e+02   -1.78    0.075 .  
## d2source3    1.03e+02   5.69e+01    1.82    0.069 .  
## smokingTRUE -1.88e-01   1.58e-01   -1.19    0.235    
## HayFever    -1.16e+00   1.86e-01   -6.24  4.5e-10 ***
## Age          8.12e-02   3.74e-02    2.17    0.030 *  
## Gender       3.67e-01   1.54e-01    2.38    0.017 *  
## roaddist2   -6.94e-08   9.24e-08   -0.75    0.453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1160.3  on 1283  degrees of freedom
## Residual deviance: 1105.9  on 1275  degrees of freedom
## AIC: 1124
## 
## Number of Fisher Scoring iterations: 4

There appears to be a weak effect due to the point sources. Interestingly, smoking seems to have no noticeable effect. It isn't even in the expected direction. (The moral is: smoke em if you got em, especially around kids. But I still wouldn't let you come near mine.)

What if we try to control for a residual spatial effect.

library(mgcv)
model <- gam(Asthma ~ d2source1 + d2source2 + d2source3 + smoking + HayFever + 
    Age + Gender + roaddist2 + s(x, y), data = asthma.df, family = binomial, 
    subset = Gender != 0)
summary(model)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Asthma ~ d2source1 + d2source2 + d2source3 + smoking + HayFever + 
##     Age + Gender + roaddist2 + s(x, y)
## 
## Parametric coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.13e-01   4.29e-01    1.43    0.153    
## d2source1    0.00e+00   0.00e+00      NA       NA    
## d2source2    2.59e+01   2.79e+01    0.93    0.352    
## d2source3    0.00e+00   0.00e+00      NA       NA    
## smokingTRUE -1.84e-01   1.59e-01   -1.16    0.246    
## HayFever    -1.16e+00   1.86e-01   -6.26  3.9e-10 ***
## Age          7.89e-02   3.76e-02    2.10    0.036 *  
## Gender       3.64e-01   1.54e-01    2.36    0.018 *  
## roaddist2   -6.85e-08   9.34e-08   -0.73    0.463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df Chi.sq p-value  
## s(x,y) 2.56   3.05   6.33     0.1 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.041   Deviance explained =  4.8%
## UBRE score = -0.12485  Scale est. = 1         n = 1284

Not much difference.

The biggest thing that seems to be going on in these data is the effect of hayfever. When we did the K function analyis, we controlled for the population, but not for the distribution of hayfever. At this point in research, you might go back and change the simulation. Instead of randomly relabelling all of the data, you might randomly relabel the persons with hayfever, and randomly relabel the persons without hayfever, and then combine them before doing the simulation. In this way, not only would you be controlling for the population distribution, but you would also be controlling for the hayfever distribution.

It turns out to not make an effect though. There is no evidence of clustering in these data. At least not based on K functions. There is slight evidence based on the regressions, and that is clustering around the point sources, NOT clustering around each other.