Hypothesis testing

Disclaimer: The contents of this document come from Chapter 12 Hypothesis testing and the Appendix of that chapter of Intro to GIS and Spatial Analysis(Gimond, 2019). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.

This document is also published on RPubs.

Learning objectives

What we do today:
- Learn the concept of IRP/CSR
- Learn how to test IRP/CSR for a given point pattern
- Learn the tests for ANN and K/L functions

12.1 IRP/CSR

Two conditions for independent random process (IRP)/complete spatial random (CSR) point patterns
1. Any event has equal probability of being in any location, a 1st order effect
2. The location of one event is independent of the location of another event, a 2nd order effect

12.2 Testing for CSR with the ANN tool

12.2.1 ArcGIS’ Average Nearest Neighbor Tool

Skip

12.2.2 A better approach: a Monte Carlo test

First, we postulate a process–our null hypothesis, Ho. For example, we hypothesize that the distribution of Walmart stores is consistent with a completely random process (CSR).
Next, we simulate many realizations of our postulated process and compute a statistic (e.g. ANN) for each realization.
Finally, we compare our observed data to the patterns generated by our simulated processes and assess (via a measure of probability) if our pattern is a likely realization of the hypothesized process.

Below: Three different outcomes from simulated patterns following a CSR point process.

Below: Histogram of simulated ANN values (from 1000 simulations)

12.2.2.1 Extracting a p-value from a Monte Carlo test

Skip: interested students are encouraged to check the chapter.

12.3 Alternatives to CSR/IRP

The assumption of CSR is a good starting point, but it’s often unrealistic.
Most real-world processes exhibit 1st and/or 2nd order effects.
Thus, simulate point-generating processes in relation to a possible factor (e.g., population density).
This represents an inhomogeneous point process, which is in fact, CSR/IRP after controlling for the factor.

Below: Examples of two randomly generated point patterns using population density as the underlying process.

Below: Histogram of the distribution of ANN values, if population density affects Walmart locations

Below: Histogram of the distribution of ANN values with income distribution affects Walmart locations

12.4 Monte Carlo test with K and L functions

For K/L functions, each band r generates simulated point patterns, from which we derive confidence intervals (e.g., 95%).
Plotting the confidence intervals for all r values produce saw-tooth like envolops.

Below: Simulation results for the IRP/CSR hypothesized process

We can also simulate point patterns based on underlying factors and compute their confidence envolopes for the K/L functions.

Below: Simulation results for an inhomogeneous hypothesized process.

12.5 Testing for a covariate effect

How can we determine if a covariate explains point patterns well enough?
Estimate two models: a null model and an alternative model
- The null model has no covariate.
- The alternative model has a covariate.
After the estimation of the two models, conduct a likelihood ratio test.
- check if its p-value is below a selected thresdhold (e.g., <0.05)

In-Class Exercise

library(tidyverse)
library(rgdal)
library(maptools)
library(raster)
library(spatstat)

1. Test for clustering/dispersion

# load the data 
load(url("http://github.com/mgimond/Spatial/raw/master/Data/ppa.RData"))

# rescale from meter to kilometer 
marks(starbucks)  <- NULL
starbucks.km <- rescale(starbucks, 1000, "km")
ma.km <- rescale(ma, 1000, "km")
pop.km    <- rescale(pop, 1000, "km")
pop.lg <- log(pop)
pop.lg.km <- rescale(pop.lg, 1000, "km")

Compute the average distance to the nearest neighbor for all 171 Starbucks in Massachusetts.

ann.p <- mean(nndist(starbucks.km, k=1))
ann.p

## [1] 3.275492

Generate the distribution of expected ANN values given a homogeneous (CSR/IRP) point process using Monte Carlo methods.

n     <- 599L               # Number of simulations
ann.r <- vector(length = n) # Create an empty object to be used to store simulated ANN values

for (i in 1:n){
  rand.p   <- rpoint(n=starbucks.km$n, win=ma.km)  # Generate random point locations
  ann.r[i] <- mean(nndist(rand.p, k=1))  # Tally the ANN values
}

Plot the last set of random points.

plot(rand.p, pch=16, main=NULL, cols=rgb(0,0,0,0.5))

Compare the distribution of simulated ANN values with the ANN value out of the actual Starbucks locations.
Interpretation: the stores are far more clustered than expected under the null.

hist(ann.r, main=NULL, las=1, breaks=40, col="bisque", xlim=range(ann.p, ann.r))
abline(v=ann.p, col="blue")

Run the same test but control for the influence due to population density distribution.
The population density raster pop.km should be used to define where a point should be most likely placed (high population density) and least likely placed (low population density) under this new null model.

n     <- 599L
ann.r <- vector(length=n)
for (i in 1:n){
  rand.p   <- rpoint(n=starbucks.km$n, f=pop.km) # f defines the probability density of the points
  ann.r[i] <- mean(nndist(rand.p, k=1))
}

Plot the last set of random points.
Note the cluster of points near the highly populated areas.
This pattern is different from the one generated from a completely random process.

Window(rand.p) <- ma.km  # Replace raster mask with ma.km window
plot(rand.p, pch=16, main=NULL, cols=rgb(0,0,0,0.5))

Compare the distribution of simulated ANN values (after controlling for population density) with the ANN value out of the actual Starbucks locations.
Interpretation: the stores are still far more clustered than expected under the null.

hist(ann.r, main=NULL, las=1, breaks=40, col="bisque", xlim=range(ann.p, ann.r))
abline(v=ann.p, col="blue")

2. Computing a pseudo p-value from the simulation ()

N.greater <- sum(ann.r > ann.p)
p <- min(N.greater + 1, n + 1 - N.greater)/(n+1)
p

## [1] 0.001666667

3. Test for a poisson point process model with a covariate effect

The ANN analysis addresses the 2nd order effect of a point process.
Here, we’ll address the 1st order process using the poisson point process model.
First, fit a model that assumes that the point process’ intensity is a function of the logged population density (the alternative hypothesis).

PPM1 <- ppm(starbucks.km ~ pop.lg.km)
PPM1

## Nonstationary Poisson process
## 
## Log intensity:  ~pop.lg.km
## 
## Fitted trend coefficients:
## (Intercept)   pop.lg.km 
##  -13.380761    1.231207 
## 
##               Estimate       S.E.    CI95.lo   CI95.hi Ztest      Zval
## (Intercept) -13.380761 0.47393184 -14.309650 -12.45187   *** -28.23351
## pop.lg.km     1.231207 0.05706877   1.119354   1.34306   ***  21.57409
## Problem:
##  Values of the covariate 'pop.lg.km' were NA or undefined at 41% (497 out 
## of 1199) of the quadrature points

Next, fit another model that assumes that the process’ intensity is not a function of population density (the null hypothesis).

PPM0 <- ppm(starbucks.km ~ 1)
PPM0

## Stationary Poisson process
## Intensity: 0.004583629
##              Estimate       S.E.   CI95.lo   CI95.hi Ztest      Zval
## log(lambda) -5.385264 0.07647191 -5.535146 -5.235382   *** -70.42147

λ(i) under the null is nothing more than the observed density of Starbucks stores within the State of Massachusetts.

starbucks.km$n / area(ma.km)

## [1] 0.008268627

Conduct the likelihood ratio test to see if the alternative model performs bettern than the null model.
The p-value gives us the probability that we would be wrong if we reject the null, which is pretty low.
That is, the alternative model (that the logged population density can help explain the distribution of Starbucks stores) is a significant improvement over the null.

anova(PPM0, PPM1, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: ~1   Poisson
## Model 2: ~pop.lg.km   Poisson
##   Npar Df Deviance  Pr(>Chi)    
## 1  498                          
## 2  499  1   492.28 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1