Conceptual Stuff

What’s special about spatial?

  • Spatial data have more information than ordinary data

  • Think of them as a triplet - Y, X, and Z, where Y is the variable of interest, X is some other information that influences Y and Z is the geographic location where Y occurred

  • If our data aren’t spatial, we don’t have Z
  • Spatial information is a key attribute of behavioral data
  • This adds a potentially interesting attribute to any data we collect


  • Spatial data monkey with models

  • Most analytical models have assumptions, spatial structure can violate these models

  • We typically want to jump into modeling, but without acknowledging or handling directly, spatial data can make our models meaningless

  • Some of the problems are..


The Ecological fallacy

  • The tendency for aggregate data on a concept to show correlations when individual data on a concept do not.

  • In general the effect of aggregation bias, whereby those studying macro-level data try to make conclusions or statements about individual-level behavior

  • This also is felt when you analyze data at a specific level, say counties, your results are only generalizeable at that level, not at the level of congressional districts, MSA’s or states.

  • The often-arbitrary nature of aggregate units also needs to be considered in such analysis.


MAUP

  • This is akin to the ecological fallacy and the notion of aggregation bias.

-The MAUP occurs when inferences about data change when the spatial scale of observation is modified.

  • i.e. at a county level there may be a significant association between income and health, but at the state or national level this may become insignificant, likewise at the individual level we may see the relationship disappear.

-This problem also exists when we suspect that a characteristic of an aggregate unit is influencing an individual behavior, but because the level at which aggregate data are available, we are unable to properly measure the variable at the aggregate level.

-E.g. we suspect that neighborhood crime rates will the recidivism hazard for a parolee, but we can only get crime rates at the census tract or county level, so we cannot really measure the effect we want.

Spatial Structure

Spatial Heterogeneity

Stationarity

Non Stationarity

Autocorrelation

Basic Assessment of Spatial Dependency

Tobler’s Law

Basic Spatial clustering

Spatial Connectivity

Example of San Antonio

library(tidycensus);  library(tidyverse)
## -- Attaching packages --------
## v ggplot2 2.2.1.9000     v purrr   0.2.4     
## v tibble  1.4.2          v dplyr   0.7.5     
## v tidyr   0.8.1          v stringr 1.3.1     
## v readr   1.1.1          v forcats 0.3.0
## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(spdep)
## Loading required package: sp
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source'))`
sa_acs<-get_acs(geography = "tract", state="TX", county = c("Bexar"), year = 2015,
                variables=c("DP05_0001E", "DP03_0009P", "DP03_0062E", "DP03_0119PE",
                            "DP05_0001E","DP02_0009PE","DP02_0008PE", "DP02_0040E","DP02_0038E",
                            "DP02_0066PE","DP02_0067PE","DP02_0080PE","DP02_0092PE",
                            "DP03_0005PE","DP03_0028PE","DP03_0062E","DP03_0099PE","DP03_0101PE",
                            "DP03_0119PE","DP04_0046PE","DP04_0078PE","DP05_0072PE","DP05_0073PE",
                            "DP05_0066PE", "DP05_0072PE", "DP02_0113PE") ,
                geometry = T, output = "wide")
## Getting data from the 2011-2015 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Using the ACS Data Profile
## Using the ACS Data Profile
sa_acs$county<-substr(sa_acs$GEOID, 1, 5)

sa_acs2<-sa_acs%>%
  mutate(totpop= DP05_0001E, fertrate = DP02_0040E,pwhite=DP05_0072PE, 
         pblack=DP05_0073PE , phisp=DP05_0066PE, pfemhh=DP02_0008PE,
         phsormore=DP02_0066PE,punemp=DP03_0009PE, medhhinc=DP03_0062E,
         ppov=DP03_0119PE, pforn=DP02_0092PE,plep=DP02_0113PE) %>%
  filter(complete.cases(totpop, ppov))

Spatial Neighbors

Contiguity based neighbors

In a general sense, we can think of a square grid. Cells that share common elements of their geometry are said to be “neighbors”. There are several ways to describe these patterns, and for polygons, we generally use the rules of the chess board.

Rook adjacency Neighbors must share a line segment

Queen adjacency Neighbors must share a vertex or a line segment

If polygons share these boundaries (based on the specific definition: rook or queen), they are said to be “spatial neighbors” of one another. The figure below illustrates this principle.

For an observation of interest, the pink area, the Rood adjacent areas are those in green in the figure, becuase they share a line segment. For the second part of the figure on the right, the pink area has different sets of neigbors, compared to the Rook rule neighbors, becuase the area also shares vertices with other polygons, making them Queen neighbors.

Adjacency using Chessboard Rules

Adjacency using Chessboard Rules

Order of adjacency

The figure above also highlights the order of adjacency among observations. By order of adjacency, we simply men that observations are either immediate neighbors (the green areas), or they are neigbors of immediate neighbors. These are referred to as first and second order neighbors.

So, we can see, that the yellow polygons are the neighboring areas for this tract, which allows us to think about what the spatial structure of the area surrounding this part of campus.

For an example, let’s consider the case of San Antonio again. If our data are polygons, then there is a function in the spdep library in R, poly2nb that will take a polygon layer and find the neighbors of all areas using either a queen or rook rule. First we form the neighbors using the rook rule for all the tracts in Bexar County.

library(spdep)
nbsR<-poly2nb(as(sa_acs2, "Spatial"), queen=F, row.names = sa_acs2$GEOID)

plot(as(sa_acs2, "Spatial"))
plot(nbsR, coords= coordinates(as(sa_acs2, "Spatial")), add= T, col="red", cex=.75, pch='.')

This is a very complicated structure of neighbors because of the dense collection of tracts.

Distance based association

The queen and rook rules are useful for polygon features, but distance based contiguity is useful for all feature types (points, polygons, lines). The idea is similar to the polygon adjacency rule from above, but the distance rule is based on the calculated distance between areas. There are a variety of distance metrics that are used in statistics, but the most commonly assumed one is the Euclidean distance. The Euclidean distance between any two points is:

\[D^2 = \sqrt{\left (x_1 - x_2 \right)^2 + \left (y_1 - y_2 \right)^2 } \] Where x and y are the coordinates of each of the two areas. For polygons, these coordinates are typically the centroid of the polygon (you may have noticed this above when we were plotting the neighbor lists), while for point features, these are the two dimensional geometry of the feature. The collection of these distances between all features forms what is known as the distance matrix between observations. This summarizes all distances between all features in the data. Some analytical routines work with these continuous distances, but most of the spatial anlaysis routines we are going to cover in this course dichotomize the distances based on a critical distance rule:

\[d_{ij} = \begin{cases} & \text{ if } D_{ij}^2 \leqslant h \text{ then } d_{ij}=1 \\ & \text{ if } D_{ij}^2 > h \text{ then } d_{ij}=0 \end{cases}\]

Which is a simple distance rule, where if the distance between any two points is less than or equal to a threshold distance \(h\), then the two are considered to be neighbors (\(d_{ij}\) = 1), while if any two observations are further apart than \(h\), they are not considered to be neighbors.

K nearest neighbors

Measuring Spatial Autocorrelation

Types of autocorrelation

Positive Autocorrelation - This means that a feature is positively associated with the values of the surrounding area (as defined by the spatial weight matrix), high values occur with high values, and low with low

Negative autocorrelation - This means that a feature is negatively associated with the values of the surrounding area (as defined by the spatial weight matrix), high with low, low with high

Measures of autocorrelation

Geary’s C

Getis-Ord G

Spatial Lag of a Variable


\[ w_{ij} = \begin{bmatrix} 0 & 1 & 1 & 0\\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0& 1\\ 0& 1& 1 & 0 \end{bmatrix} \]


Typically this matrix is standardized, by dividing each element of \(w_{ij}\) by the number of neighbors, this is called row-standardized:

\[ w_{ij} = \begin{bmatrix} 0 & .5 & .5 & 0\\ .5 & 0 & 0 & .5 \\ .5 & 0 & 0& .5\\ 0& .5& .5 & 0 \end{bmatrix} \]


and a variable z, equal to:

\[z=\begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}\]

When we form the product: \(z'W\), we get:

\[z_{lag}=\begin{bmatrix} 2.5 & 2.5 & 2.5 & 2.5 \end{bmatrix}\]

In R, we can simply do:

z<-c(1,2,3,4)
w<-matrix(c(0,.5,.5,0,.5,0,0,.5,.5,0,0,.5,0,.5,.5,0), nrow = 4, byrow = T)
z
## [1] 1 2 3 4
w
##      [,1] [,2] [,3] [,4]
## [1,]  0.0  0.5  0.5  0.0
## [2,]  0.5  0.0  0.0  0.5
## [3,]  0.5  0.0  0.0  0.5
## [4,]  0.0  0.5  0.5  0.0
z%*%w
##      [,1] [,2] [,3] [,4]
## [1,]  2.5  2.5  2.5  2.5

is the spatially lagged value of z.

San Antonio Poverty Rate Map

Here is the overall poverty rate map for San Antonio

library(classInt)
sa_acs2%>%
  mutate(povcut=cut(sa_acs2$ppov,breaks=data.frame(classIntervals(var=sa_acs2$ppov, n=6, style="jenks")[2])[,1], include.lowest = T))%>%
  ggplot(aes(color=povcut, fill=povcut))+geom_sf()+
  scale_fill_brewer(palette = "Blues") + 
  scale_color_brewer(palette = "Blues")

Moran’s I Statistic

One of the most popular global autocorrelation statistic is Moran’s I [@Moran1950]

\(I = \frac{n}{(n - 1)\sigma^2 w_{..}} \sum^i_n \sum^j_n w_{ij} (Z(s_i) - \bar Z)(Z(s_j) - \bar Z)\)

with \(Z(s_i)\) being the value of the variable, the poverty rate for example, at location i, \(Z(s_j)\) being the value of the poverty rate at location j, \(\sigma^2\) is sample variance of the poverty rate, \(w_{ij}\) is the weight for location ij (0 if they are not neighbors, 1 otherwise).

Moran’s I is basically a correlation, think of Pearson’s \(\rho\) between a variable and a spatially lagged version of itself.

library(spdep)
library(RColorBrewer)

#Queen Spatial weight matrix, row standardized
nbs<-poly2nb(as(sa_acs2,"Spatial"), queen = T)
nbs
wts<-nb2listw(neighbours = nbs, style = "W")
moran.test(sa_acs2$ppov, listw = wts)

Moran’s I Scatterplot It is sometimes useful to visualize the relationship between the actual values of a variable and its spatially lagged values. This is the so called Moran scatterplot

Lagged values are the average value of the surrounding neighborhood around location i

lag(Z) = \(z_{ij} * w_{ij}\) = \(z'W\) in matrix terms

The Moran scatterplot shows the association between the value of a variable in a location and its spatial neighborhood’s average value. The variables are generally plotted as z-scores,to avoid scaling issues.

And here we show the Moran scatterplot: #Moran Scatter plot

sa_acs2$zpov<-as.numeric(scale(sa_acs2$ppov))
moran.plot(x = sa_acs2$zpov, listw = wts, labels = F, xlab = "Poverty Rate in Tracts- z score", ylab="Lagged Poverty Rate in Tracts - z score", main="Moran Scatterplot of Income in Bexar County Census Tracts")

moran.test(x=sa_acs2$ppov, listw = wts)
## 
##  Moran I test under randomisation
## 
## data:  sa_acs2$ppov  
## weights: wts    
## 
## Moran I statistic standard deviate = 18.344, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5417209905     -0.0027548209      0.0008810032

Other autocorrelation measures

geary.test(sa_acs2$ppov, listw = wts)
## 
##  Geary C test under randomisation
## 
## data:  sa_acs2$ppov 
## weights: wts 
## 
## Geary C statistic standard deviate = 16.482, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance 
##       0.456739455       1.000000000       0.001086398
globalG.test(sa_acs2$ppov, listw = wts)
## Warning in globalG.test(sa_acs2$ppov, listw = wts): Binary weights
## recommended (sepecially for distance bands)
## 
##  Getis-Ord global G statistic
## 
## data:  sa_acs2$ppov 
## weights: wts 
## 
## standard deviate = 16.036, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Global G statistic        Expectation           Variance 
##       3.731959e-03       2.754821e-03       3.712817e-09

Local Autocorrelation Statistics


Here is a LISA map for clusters of poverty in San Antonio:

which shows areas of low poverty clustering in blue, and high poverty clustering in red.


What these methods tell you