Lecture Notes

Exploratory Spatial Data Analysis

What is ESDA? In exploratory data analysis, we are looking for trends and patterns in data. We are working under the assumption that the more one knows about the data, the more effectively it may be used to develop, test and refine theory. This generally requires we follow two principles: skepticism and openness.

One should be skeptical of measures that summarize data, since they can conceal or misrepresent the most informative aspects of data, and..

We must be open to patterns in the data that we did not expect to find, because these can often be the most revealing outcomes of the analysis.

We must avoid the temptation to automatically jump to the confirmatory model of data analysis.

The confirmatory model asks: Do my data confirm the hypothesis that X causes Y?
We would normally fit a linear model (or something close to it) and use summary measures (means and variances) to test if the pattern we observe in the data is real.

The exploratory model says: To the contrary, What do the data I have tell me about the relationship between X and Y? This lends to a much more open range of alternative explanations

The principle of EDA is explained best in the simple model: Data = smooth+rough The smooth bit is the underlying simplified structure of a set of observations. This is the straight line that we expect a relationship to follow in linear regression for example, or the smooth curve describing the distribution of data: the pattern or regularity in the data.

You can also think of the smooth as a central parameter of a distribution, the mean or median for example.

The rough is the difference between our actual data, and the smooth. This can be measured as the deviation of each point from the mean.

Outliers, for example are very rough data, they have a large difference from the central tendency in a distribution, where all the other data points tend to clump

Spatial data have these smooth and rough properties, but they also have two others: The first and second order arrangement characteristics of their attribute values, which equate to: * spatial trend and * autocorrelation

We can also refer to these ideas in spatial data as global (smooth) verses local (rough) properties of the data.

We will use new tools (namely GeoDa) to do some Spatial EDA.

The way GeoDa differs from traditional ways of viewing data (on paper, or with summary statistics), is that it goes beyond presenting the data, to visualizing the data.

Dynamic Data visualization is a dynamic process that incorporates multiple views of the same information, i.e. we can examine a histogram that is linked to a scatter plot that is linked to a map to visualize the distribution of a single variable, how it is related to another and how it is patterned over space.

We can also do what is called brushing the data, so we can select subsets of the information and see if, there are distinct subsets of the data that have different relational or spatial properties. This allows us to really visualize how the processes we study unfolds over space and allows us to see locations of potentially influential observations.

Examples of (E)SDA Items

Histograms Histograms are useful tools for representing the distribution of data They can be used to judge central tendency (mean, median, mode), variation (standard deviation, variance), modality (unimodal or multi-modal), and graphical assessment of distributional assumptions (Normality)

qplot(dat$ppersonspo, geom="histogram", binwidth=.05)

Distribution of the poverty rate in census tracts in San Antonio, TX.

Boxplots Box plots (or box and whisker plots) are another useful tool for examining the distribution of a variable You can visualize the 5 number summary of a variable Miniumum, Maximum, lower quartile, Median, and upper quartile

ggplot(dat@data, aes(x=decade, y=PHISPANIC))+ geom_boxplot()

Boxplot of %Hispanic in a census tract, by median year homes were built in San Antonio, TX

Scatter plots Scatterplots show bivariate relationships This can give you a visual indication of an association between two variables Positive association (positive correlation) Negative association (negative correlation) Also allows you to see potential outliers (abnormal observations) in the data

qplot(y = dat$ppersonspo, x=dat$PHISPANIC)

Scatterplot of %Hispanic in a census tract, by % in poverty, in San Antonio, TX

Parallel coordinate plots Parallel coordinate plots allow for visualization of the association between multiple variables (multivariate) Each variable is plotted according to its “coordinates” or values

ggparcoord(data=dat@data, columns = c("ppersonspo", "PHISPANIC", "PWHITE"), groupColumn = "MEDHHINC", order = "allClass", scale="uniminmax")

Parallel coordinate plot of %in poverty, %White and %Hispanic, in San Antonio, TX, shaded by the median household income of the tract.

Area statistics/Local statistics

map1 Map of raw poverty rate in San Antonio TX, from ACS 2009 5 -year data release

map2 Map of locally averaged poverty rate in San Antonio TX, from ACS 2009 5 -year data release

Local statistics allow you to see how the mean, or variation, of a variable varies over space. You can generate a local statistic by weighting an ordinary statistic by some kind of spatial weight.

Global and Local Statistics By global we imply that one statistic is used to adequately summarize the data, i.e. the mean or median Or, a regression model that is suitable for all areas in the data Local statistics are useful when the process you are studying varies over space, i.e. different areas have different local values that might cluster together to form a local deviation from the overall mean Or a regression model that accounts for variation in the variables over space. This we often call spatial heterogeniety, meaning the process that generates the data varies over space. This is also called a spatial regime.

Stationarity Stationarity simply means that the process is not changing with respect to either time (i.e. time series analysis) or space.
This implies that the process that has generated our data is acting the same way in all areas under study.
The implications of Stationarity are that we can use a global statistic to measure our process and not feel too bad about it. It also implies that our observations are iid (independent and identically distributed) with respect to one another e.g. the parameters estimated by the regression of X on Y are the same throughout our area of study, and do not have a tendency to change. Also, it means the model estimated is equally well specified at all locations. This is our general assumption in regression models

Non-Stationarity If a process is non-stationary then the process changes with respect to time or space.
This implies that the process that has generated our data is not acting the same way in all areas, or the expected value (mean, or variance) of our data are subject to spatial fluctuations.
If our data are subject to such fluctuations, the this implies that our global statistics are also subject to major local fluctuations Meaning areas in our data can tend to cluster together and have similar values

Autocorrelation This can occur in either space or time Really boils down to the non-independence between neighboring values The values of our independent variable (or our dependent variables) may be similar because Our values occur closely in time (temporal autocorrelation) closely in space (spatial autocorrelation)

Basic Assessment of Spatial Dependency Before we can model the dependency in spatial data, we must first cover the ideas of creating and modeling neighborhoods in our data. By neighborhoods, I mean the clustering or connectedness of observations The exploratory methods we will cover today depend on us knowing how our data are arranged in space, who is next to who. This is important (as we will see later) because most correlation in spatial data tends to die out as we get further away from a specific location (Tobler’s law)

Waldo Tobler (1970) suggested the “jokingly” first law of geography “Everything is related to everything else, but near things are more related than distant things”

We can see this better in graphical form: We expect the correlation between the attributes of two points to diminish as the distance between them grows

Basic clustering Clustering means that observations that are close geographically are close in other attributes. Autocorrelation is typically a local process. Meaning it typically dies out as distance between observations increase.

So our statistics that correct for, or in fact measure spatial association have to account for where we are with respect to the observation under present consideration. This is typically done by specifying/identifying the spatial connectivity between spatial observations. Spatial connectivity, or a spatial neighborhood, is defined based on the interactions/associations between features in our data This connectivity is often in terms of the spatial weight of an observation, in other words how much of the value of a surrounding observation do we consider when we are looking at spatial correlation.
Typically the weight of a neighboring observation dies out the further it is away from our feature of interest.

There are two typical ways in which we measure spatial relationships Distance and contiguity

Distance based association In a distance based connectivity method, features (generally points) are considered to be contiguous if they are within a given radius of another point. The radius is really left up to the researcher to decide. For example we did this in the point analysis lab, where we selected roads within a mile of hospitals. We can equally do it to search for other hospitals within a given radius of every other hospital. The would then be labeled as neighbors according to our radius rule. Likewise, we can calculate the distance matrix between a set of points This is usually measured using the standard Euclidean distance

\(d^2=\sqrt{(x_1-x_2)^2 + (y_1 - y_2)^2}\)

Where x and y are coordinates of the point or polygon in question (selected features), this is the as the crow flies distance. There are lots of distances

Spatial Neighbors There are many different criteria for deciding if two observations are neighbors Generally two observations must be within a critical distance, d, to be considered neighbors. This is the Minimum distance criteria, and is very popular. This will generate a matrix of binary variables describing the neighborhood. We can also describe the neighborhoods in a continuous weighting scheme based on the distance between them

Inverse Distance Weight \(w_{ij} = \frac{1}{d_{ij}}\) or Inverse-Squared Distance Weight \(w_{ij} = \frac{1}{d_{ij}^2}\)

K nearest neighbors A useful way to use distances is to construct a k-nearest neighbors set. This will find the “k” closest observations for each observation, where k is some integer. For instance if we find the k=3 nearest neighbors, then each observation will have 3 neighbors, which are the closest observations to it, regardless of the distance between them which is important. Using the k nearest neighbor rule, two observations could potentially be very far apart and still be considered neighbors.

Adjacency weights If the other feature is within the threshold distance, a 1 is given, otherwise a 0 is given. Neighborhoods are created based on which observations are judged “contiguous”

This is generally the best way to treat polygon features Polygons are contiguous if they share common topology, like an edge (line segment) or a vertex (point). Think like a chess board: Rook adjacency Neighbors must share a line segment

Queen adjacency Neighbors must share a vertex or a line segment If polygons share these boundaries (based on the specific definition: rook or queen), they are given a weight of 1 (adjacent), else they are given a value 0, (nonadjacent)

Adjacency

What does a spatial weight matrix look like? Assume this is our data: Adjacency2

This would be a Rook-based adjacency weight matrix:

\[ w_{ij} = \begin{bmatrix} 0 & 1 & 1 & 0\\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0& 1\\ 0& 1& 1 & 0 \end{bmatrix} \]

\(w_{ij}\) = 1 if polygons share a border, 0 if they don’t. Also note that an observation can’t be it’s own neighbor and the diagonal elements of the matrix are all 0.

Measuring Spatial Autocorrelation If we observe data Z(s) (an attribute) at location i, and again at location j The spatial autocorrelation between \(Z(s)_i\) and \(Z(s)_j\) is degree of similarity between them, measured as the standardized covariance between their locations and values. In the absence of spatial autocorrelation the locations of \(Z(s)_i\) and \(Z(s)_j\) has nothing to do with the values of \(Z(s)_i\) and \(Z(s)_j\) OTOH, if autocorrelation is present, close proximity of \(Z(s)_i\) and \(Z(s)_j\) leads to close values of their attributes.

Positive autocorrelation This means that a feature is positively associated with the values of the surrounding area (as defined by the spatial weight matrix), high values occur with high values, and low with low

Negative autocorrelation This means that a feature is negatively associated with the values of the surrounding area (as defined by the spatial weight matrix), high with low, low with high The (probably) most popular global autocorrelation statistic is Moran’s I:

\(I = \frac{n}{(n - 1)\sigma^2 w_{..}} \sum^i_n \sum^j_n w_{ij} (Z(s_i) - \bar Z)(Z(s_j) - \bar Z)\)

with \(Z(s)_i\) being the value of the attribute at location i, \(Z(s)_j\) being the value of the attribute at location j, \(\sigma^2\) is sample variance, \(w_{ij}\) is the weight for location ij (0 if they are not neighbors, 1 otherwise).

If I is greater than E(I), there is global positive autocorrelation, if I is less than E(I), there is global negative autocorrelation. Where E(I) is the expected value of I.

\(E(I)=-\frac{1}{n-1}\)

Moran’s I Scatterplot It is sometimes useful to visualize the relationship between the actual values of the dependent variable and their lagged values. This is the so called Moran scatterplot Lagged values are the average value of the surrounding neighborhood around location i lag(Z) = \(z_{ij} * w_{ij}\) = \(z'W\) in matrix terms The Moran scatterplot shows the association between the observation of interest and its neighborhood’s average value. The variables are generally plotted as z-scores,to avoid scaling issues.

Moran Scatter plot for San Antonio Poverty Rates

Here is the overall poverty rate map for San Antonio, the Global Moran’s I was .655.

And here we show the Moran scatterplot:

#Moran Scatter plot
moran.plot(as.numeric(scale(dat$ppersonspo)), listw=salw, xlab="Poverty Rate", ylab="Lagged Poverty Rate", main=c("Moran Scatterplot for Poverty", "in San Antonio. I=.655") )

Moran’s I is basically a correlation think Pearson’s \(\rho\) on a variable and a spatially lagged* version of itself. Spatial lags of a variable are done via multiplying the variable through the spatial weight matrix for the data.

If we have a value \(Z(s_i)\) at location i and a spatial weight matrix \(w_{ij}\) describing the spatial neighborhood around location i, we can find the lagged value of the variable by: \(WZ_i = Z(s_i) * w_{ij}\)

This calculates what is effectively, the neighborhood average value in locations around location i, often stated \(Z(s_{-i})\)

Again, if we had the adjacency matrix from above, a Rook-based adjacency weight matrix.

\[ w_{ij} = \begin{bmatrix} 0 & 1 & 1 & 0\\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0& 1\\ 0& 1& 1 & 0 \end{bmatrix} \]

Typically this matrix is standardized, by dividing each element of \(w_{ij}\) by the number of neighbors, this is called row-standardized:

\[ w_{ij} = \begin{bmatrix} 0 & .5 & .5 & 0\\ .5 & 0 & 0 & .5 \\ .5 & 0 & 0& .5\\ 0& .5& .5 & 0 \end{bmatrix} \]

and a variable z, equal to:

\[z=\begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}\]

When we form the product: \(z'W\), we get:

\[z_{lag}=\begin{bmatrix} 3.5 & 3 & 3 & 3.5 \end{bmatrix}\]

Which, now we see where we get the y-value of the moran scatterplot. It is just the lagged version of the original variable.

Bivariate Moran Plots The so-called bivariate Moran statistic, is just the correlation of a lagged variable, y, with another variable, x. An example from San Antonio, is a bivariate Moran analysis of Poverty and Crime rates.

## [1] 0.1283409

Which is a pretty low correlation (I = .125). This shows slight positive autocorrelation, meaning that areas with higher crime rates typically surround areas with higher poverty rates.

Local Moran’s I So far, we have only seen a Global statistic for autocorrelation, and this tells us if there is any overall clustering in our data. We may be more interested in where the autocorrelation occurs, or where clusters are located. A local version of the Moran statistic is avaialble as well. This basically calculates the Moran statistic from above, but only for the local neighborhood. It compares the observation’s value to the local neighborhood average, instead of the global average. Anselin (1995) referred to this as a “LISA” statistic, for Local Indicator of Spatial Autocorrelation.

Here is a LISA map for clusters of poverty in San Antonio: LISA

which shows areas of low poverty clustering in blue, and high poverty clustering in red. These are so-called clusters, becuase they are areas with higher (or lower, for the blues) than average poverty rates, surrounded by areas with with higher than average poverty rates. The red clusters are so called “high-high clusters”, likewise the blue areas are called “low-low clusters”. We also see light pink and light blue polygons. The light pink polygons represent areas that have high poverty rates, but are in a low poverty spatial neighborhood, and are called high-low outliers. The one in the the northwest part of San Antonio is a good example, because it’s the UTSA main campus, a student ghetto, if you will.

Likewise, the light blue polygons, are called low-high outliers, becuase they have low poverty rates, but are in a high poverty spatial neighborhood.

What these methods tell you * Moran’s I is a descriptive statistic ONLY, * It simply indicates if there is spatial association/autocorrelation in a variable * Local Moran’s I tells you if there is significant localized clustering of the variable

Empirical Example

Geoda is very adept at doing this stuff, but alas, it’s point and click. Coding means you have something to come back to when you want to do the same job again, meaning your research is reproducible. What if you click the wrong button next time?

First we load some libraries we need and the data, which is stored in an ESRI shapefile. All data can be found on Github.

library(spdep)
library(maptools)
library(RColorBrewer)
library(classInt)
setwd("~/Google Drive/dem7263/data/")
dat<-readShapePoly("SA_classdata.shp", proj4string=CRS("+proj=utm +zone=14 +north"))

#create a mortality rate, 3 year average
dat$mort3<-apply(dat@data[, c("deaths09", "deaths10", "deaths11")],1,sum)
dat$mortrate<-1000*dat$mort3/dat$acs_poptot

#get rid of tracts with very low populations
dat<-dat[which(dat$acs_poptot>100),]

#Show a basic quantile map of the mortality rate
spplot(dat, "mortrate",
       main="Mortality Rate, 3 Year Avg, 2009-2011",
       at=quantile(dat$mortrate), 
       col.regions=brewer.pal(n=5, "Blues"))

#Make a rook style weight matrix
sanb<-poly2nb(dat, queen=F)
summary(sanb)
## Neighbour list object:
## Number of regions: 234 
## Number of nonzero links: 1094 
## Percentage nonzero weights: 1.997955 
## Average number of links: 4.675214 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 
##  4 11 29 63 68 31 23  3  2 
## 4 least connected regions:
## 61 82 147 205 with 1 link
## 2 most connected regions:
## 31 55 with 9 links
salw<-nb2listw(sanb, style="W")
#the style = W row-standardizes the matrix

#make a k-nearest neighbor weight set, with k=3
knn<-knearneigh(x = coordinates(dat), k = 3)
knn3<-knn2nb(knn = knn)

#Visualize the neighbors
plot(dat, main="Rook Neighbors")
plot(salw, coords=coordinates(dat), add=T, col=2)

plot(dat, main="k=3 Neighbors")
plot(knn3, coords=coordinates(dat), add=T, col=2)

#generate locally weighted values
wmat<-nb2mat(sanb, style="W")
dat$mort.w<-t(dat$mortrate%*%wmat)

#plot the raw mortality rate and the spatiallly averaged rate.
spplot(dat, c("mortrate", "mort.w"),
       at=quantile(dat$mortrate), 
       col.regions=brewer.pal(n=5, "Blues"),
       main="Mortality Rate, 3 Year Avg, 2009-2011")

#Calculate univariate moran's I for the mortality rate
moran.test(dat$mortrate, listw=salw)
## 
##  Moran's I test under randomisation
## 
## data:  dat$mortrate  
## weights: salw  
## 
## Moran I statistic standard deviate = 10.8154, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.472820119      -0.004291845       0.001946053
moran.mc(dat$mortrate, listw=salw, nsim=999)
## 
##  Monte-Carlo simulation of Moran's I
## 
## data:  dat$mortrate 
## weights: salw  
## number of simulations + 1: 1000 
## 
## statistic = 0.4728, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
#Moran Scatter plot
moran.plot(dat$mortrate, listw=salw)

#Local Moran Analysis
locali<-localmoran(dat$mortrate, salw, p.adjust.method="fdr")
dat$locali<-locali[,1]
dat$localp<-locali[,5]

#We have to make our own cluster identifiers
dat$cl<-as.factor(ifelse(dat$localp<=.05,"Clustered","NotClustered"))

#Plots of the results
spplot(dat, "locali", main="Local Moran's I", at=quantile(dat$locali), col.regions=brewer.pal(n=4, "RdBu"))

spplot(dat, "cl", main="Local Moran Clusters", col.regions=c(2,0))

#Local Getis-Org G analysis
localg<-localG(dat$mortrate, salw)
dat$localg<-as.numeric(localg)

#Plots
spplot(dat, "localg", main="Local Geary's G", at=c(-4, -3,-2,-1,0,1,2,3, 4), col.regions=brewer.pal(n=8, "RdBu"))