Markdown Author: Jessie Bell
Download this Rmd: Top right corner → Code → Download Rmd
birdRichnessMexico.rds contains points of bird richness
(# of species) at a bunch of points in Mexico. The column
nSpecies is the number of bird species occurring within a
km of that location.
Projection: North American Albers Equal Area Conic
Center: Mexico (SR-ORG:38)
Units: Kilometers (km)
Simple map of the data are shown below:
birds_sf <- readRDS("birdRichnessMexico.rds")
tmp_Mexico <- data.frame(st_coordinates(st_transform(birds_sf, crs=4326)), nSpecies=birds_sf$nSpecies) #started on Andy's code and then needed API key for google and have no time to mess.
tmap_mode("view")
tm_basemap(c(StreetMap = "OpenStreetMap",
TopoMap = "OpenTopoMap"))+
tm_shape(birds_sf) +
tm_symbols(col="nSpecies",alpha = 0.7, palette = rev(hcl.colors(5, "Magenta")))
In the map above, you can see that species richness seems to be
highest in southern Mexico, in the northern half – species richness is
greatest along the midwest, a bit off from the coast. After adding the
topo layer into the tmap, I noticed that species richness
increases in areas of lower elevation and nearest a mountain range.
In order to determine which points in
birdsRichnessMexico.rds are correlated, I will use the
sf simple features to plug into a cloud variogram
first.
birdVarCloud <- variogram(nSpecies~1, birds_sf, cloud=T)
plot(birdVarCloud, pch=20, cex=1.5, col="#ff9e9e", alpha=0.1, ylab=expression(Semivariance~(gamma)), xlab="Distance (km)", main = "Species Richness")
Figure 1: On the x-axis of the plot is the distance
between the locations, and on the y-axis is the difference of their
nSpecies values squared. That is, each dot in the
variogram represents a pair of locations, not the
individual locations on the map. This is a bit messy, the points from
about 40,000 and above are not spatially autocorrelated. To view this
more clearly, we can get an average model value for the points
(essentially binning them).
birdVar <- variogram(nSpecies~1, birds_sf, cloud = FALSE)
plot(birdVar,pch=20,cex=1.5,col="#8aff9c",
ylab=expression(Semivariance~(gamma)),
xlab="Distance (km)", main = "Bird Species Richness")
Figure 2: Average model values for autocorrelation of species richness. Here you can see that bird species richness in Mexico seems to be autocorrelated out to a distance of about 700 km, because γ levels off and is larger in number at that distance.
First, the test:
d <- as.matrix(dist(st_coordinates(birds_sf)))
# make an empty weights matrix of the right dimensions
w <- matrix(0,ncol=ncol(d),nrow=nrow(d))
# set some values to 1
w[d>500 & d<=1000] <- 1
# convert a the weights matrix to a weights list object
# so spdep is happy
wList <- mat2listw(w)
# calculate I
moran.test(birds_sf$nSpecies,wList)
##
## Moran I test under randomisation
##
## data: birds_sf$nSpecies
## weights: wList
##
## Moran I statistic standard deviate = 15.127, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 1.451094e-01 -5.025126e-03 9.851016e-05
For this, I am going to use the ncf library. I am going
to use the distance rule to determine how far out to go on this
graph:
birdX <- st_coordinates(birds_sf)[,1]
birdY <- st_coordinates(birds_sf)[,2]
birdRichness <- birds_sf$nSpecies
max(birdX) - min(birdX)
## [1] 2808.024
## [1] 2808.024
max(birdY) - min(birdY)
## [1] 1824.946
## [1] 1824.946
birdD <- dist(cbind(birdX,birdY))
max(birdD)/3
## [1] 1037.547
birdI <- correlog(x=birdX, y=birdY, z=birdRichness,
increment=100, resamp=100, quiet=TRUE)
The furthest point-to-point distance is about 3100 km, and 1/3 of that distance is about 1000 km. I will only interpret distances of 1000 km in the following graphs.
bird_df <- data.frame(n=birdI$n,
I=birdI$correlation,
d = birdI$mean.of.class,
p=birdI$p)
bird_df %>%
mutate(Significant = p < 0.1) %>%
ggplot() +
geom_hline(yintercept = 0,linetype="dashed") +
geom_path(aes(x = d, y = I, group = 1, color=Significant),size=1) +
geom_point(aes(x = d, y = I, fill=Significant),size=5,pch=21) +
lims(x=c(0,1000),y=c(-0.6,0.6)) +
scale_fill_manual(values = c("#fd4a09","#aed601")) +
scale_color_manual(values = c("#fd4a09","#aed601")) +
labs(x="Distance (km)",y="Moran's I",
title = "Autocorrelation of Bird Species",subtitle = "Critical value of p<0.01")+
theme_classic()
Figure 3: Correlogram for bird richness in Mexico. You can see that there seems to be significant spatial autocorrelation at distances of 1000 km.
ggplot(bird_df, aes(d, I))+
stat_density_2d(aes(fill = ..level..), geom = "polygon")+
scale_fill_distiller(palette= "Set2", direction=1)+
geom_point(bird_df, mapping=aes(x=d, y=I), alpha=0.7, color="#fd4a09")+
theme_classic()
Figure 4: Attempting to get a ggplotly like Andy’s and will revisit this if I have more time.
Question: What can you learn by comparing the approaches? Maybe more importantly what are we missing with this approach? Where does this understanding of pattern help us with process and where is it lacking?
Answer: I had to skip ahead and come back to this because I didn’t know how to answer it. I think that this approach does not include directionality and we could benefit from including it (as we did in the next part).
birdVar <- variogram(nSpecies~1, data = birds_sf, alpha=c(0, 90))
plot(birdVar)
Put on a biogeography hat and explain what might be happening here.
Answer: I think that the anisotropy (the fact that there is higher species richness on the southern half of Mexico) of the data is because of the rain shadow falling on the eastern side of the Sierra Madras. The increase in rain likely allows an increase in plant diversity as well as an increase in bird diversity. You can see above that points in southern part of Mexico (90°) have higher spatial autocorrelation as well.
Imagine we have two points with locations specified by x and y, we can find the distance between them using the Euclidian Distance:
\(D =\) \(\sqrt {(x_2 - x_1)^2 + (y_2 - y_1)^2}\)
Now, calculate pairwise distances for a set of points:
x <- c(.45, .375, .45, .678, .55)
y <- c(.02, .87, .47, .9, .55)
dat <- data.frame(x,y)
distancematrix <- dist(dat)
# 1 2
#2 0.8533024
#3 0.4500000 0.4069705
#4 0.9090567 0.3044815
#5 0.5393515 0.3647259
# 3 4
#2
#3
#4 0.4867073
#5 0.1280625 0.3726714
Here you can see that there are \(\frac{n\times(n-1)}2\) pairwise elements of D. \((5\times\frac{4}2 = 10)\) So there are 10 pairs of values. Thus…the distance between the first and second points can be calculated as follows:
Now let’s add some attributes to those points!
data(meuse.all)
meuse_df <- data.frame(Variables=colnames(meuse.all))
gt(head(meuse_df))|>
opt_table_font()|>
opt_table_outline(style="solid")|>
tab_header(
title = "Meuse River Variables")|>
opt_stylize(style = 5, color = "cyan")|>
tab_options(table.align='center', table.font.size = px(10))
| Meuse River Variables |
| Variables |
|---|
| sample |
| x |
| y |
| cadmium |
| copper |
| lead |
Transform the data into an sf object:
meuse_sf <- st_as_sf(meuse.all, coords = c("x","y")) |>
st_set_crs(value = 28992)
And make a plot of the lead concentation in the soils.
ggplot(data=meuse_sf)+
geom_sf(mapping=aes(fill=lead, size=lead), shape=21, alpha=0.6)+
scale_fill_continuous(type="viridis", name="ppm")
tmap_mode('view')
tm_shape(meuse_sf) +
tm_dots(col = "lead", palette = "viridis") #try tmap too here
For determining spatial autocorrelation. Semivariograms can be used for both descriptive statistics and predictive statistics.
Semi-variance: how strongly does an observed observation (i.e. lead concentration) vary as a function of distance. The y axes on these are the Euclidean distance values between the points (in this case it is the lead concentration).
leadVarCloud <- variogram(lead~1, meuse_sf, cloud=T)
plot(leadVarCloud, pch=20, cex=1.5, col="plum", alpha=0.1, ylab=expression(Semivariance~(gamma)), xlab="Distance (m)", main = "Lead concentrations (ppm)")
Things that are more correlated in a semi-variogram are the smaller value points
The variogram cloud above (This shows majority of the noise (all the upper translucent points)), but has too much data on it to synthesize, so we average them into the next graph shown.
leadVar <- variogram(lead~1, meuse_sf, cloud=F)
plot(leadVar, pch=20, cex=1.5, col="darkmagenta",
ylab=expression(Semivariance~(gamma)), xlab="Distance (m)", main="Lead concentrations (ppm)")
This graph has way less noise than the cloud. You could say that the lead concentrations in this dataset are autocorlated out to a distance of about 750 m because \(\gamma\) is small at distances less than 750 m before tappering off a bit. We will build models for this soon. These models normalize the semivariogram into something that ranges from -1 to 1. This is called, Moran’s I.
This has really nice probablity properties, which is why we use it.
\(I =\) \(\frac{N}{\Sigma_i \Sigma_jw_{ij}}\) \(\frac{\Sigma_i \Sigma_j w_{ij}(Z_i-\bar{Z})(Z_j - \bar{Z})}{\Sigma_i(Z_i-\bar{Z})^2}\)
\(Z\) is the variable being measured, and \(w\) is a matrix of weights (i.e. the inverse of the distance between points).
This is similar to the boolean yes/no when turning certain functions on or off. Elements that are further away from one another will have a smaller weight. You can transform Moran’s I into a Z score for hypothesis testing. This means that a Z-score of > 1.96 or < -1.96 would mean there is significant spatial autocorrelation with \(\alpha=0.05\)
spdep# distance matrix
d <- dist(st_coordinates(meuse_sf))
# inverse distance matrix
w <- as.matrix(1/d)
# convert a the weights matrix to a weights list object
# so spdep is happy
wList <- mat2listw(w)
# calculate I
moran.test(meuse_sf$lead,wList)
##
## Moran I test under randomisation
##
## data: meuse_sf$lead
## weights: wList
##
## Moran I statistic standard deviate = 10.687, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.1009726698 -0.0061349693 0.0001004419
These results indicate the lead data are positive (but very weakly) autocorrelated when we use \(W\) as inverse distances. As we saw previously though, values above 1000 m don’t actually have a lot of influence on one another. The relationship between distance and inverse distance is:
x <- as.vector(as.matrix((dist(st_coordinates((meuse_sf))))))
y <- as.vector(w)
ggplot()+
geom_line(aes(x=x[x>0], y=y[x>0]), color="pink")+
labs(x="Distance (m)", y="Inverse distance (W)")+
lims(x=c(0,1500))+
theme_classic()
Let’s look at Moran’s I when points are within 100 m of each other.
d <- as.matrix(dist(st_coordinates(meuse_sf)))
# make an empty weights matrix
w <- matrix(0,ncol=ncol(d),nrow=nrow(d))
# set some values to 1 using a logical mask of d<100
w[d<100] <- 1
# convert a the weights matrix to a weights list object
# so spdep is happy
wList <- mat2listw(w)
# calculate I
moran.test(meuse_sf$lead,wList)
##
## Moran I test under randomisation
##
## data: meuse_sf$lead
## weights: wList
##
## Moran I statistic standard deviate = 9.4272, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.784497747 -0.006134969 0.007033667
Note: The Moran’s I at smaller distances is quite high…indicating high autocorrelation for lead concentrations. Let’s try more distances. Let’s look at only points between 500 and 1000 m of one another.
# make an empty weights matrix of the right dimensions
w <- matrix(0,ncol=ncol(d),nrow=nrow(d))
# set some values to 1
w[d>500 & d<=1000] <- 1
# convert a the weights matrix to a weights list object
# so spdep is happy
wList <- mat2listw(w)
# calculate I
moran.test(meuse_sf$lead,wList)
##
## Moran I test under randomisation
##
## data: meuse_sf$lead
## weights: wList
##
## Moran I statistic standard deviate = -6.6794, p-value = 1
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.1077445658 -0.0061349693 0.0002314141
This shows Moran’s I as a correlogram by distances 0 to 1500 m in 200 m bins:
distanceInterval <- 200
distanceVector <- seq(0,2000,by=distanceInterval)
n <- length(distanceVector)
d <- as.matrix(dist(st_coordinates(meuse_sf)))
# make an object to hold results
res <- data.frame(midBin=rep(NA,n-1),I=rep(NA,n-1))
for(i in 2:n){
w <- matrix(0,ncol=ncol(d),nrow=nrow(d))
# set some values to 1
w[d >= distanceVector[i-1] & d < distanceVector[i]] <- 1
# convert a the weights matrix to a weights list object
# so spdep is happy
wList <- mat2listw(w)
# calculate I
res$I[i-1] <- moran.test(meuse_sf$lead,wList,zero.policy=TRUE)$estimate[1]
# centered distance bin
res$midBin[i-1] <- distanceVector[i] - distanceInterval/2
}
ggplot(data=res, mapping = aes(x=midBin,y=I)) +
geom_hline(yintercept = 0, linetype="dashed") +
geom_line() + geom_point(size=3) +
labs(x="Distance (m)",y="Moran's I", title = "Moran's I as function of distance")
Above, you can see that the values of lead are autocorrelated for a few hundred m and then falls off. This is conceptually similar to Ripley’s K.
Often, and especially with areal data, we calculate the weights matrix using just the correlation between the \(k\) closest neighbors and usually set \(k=8\) (there is nothing magical about eight – it is a loose kind of rule).
w <- knn2nb(knearneigh(meuse_sf, k=8)) #knearneigh tells you the nearest neighbors for each point and then turns the object into an nb with knn2nb.
moran.test(meuse_sf$lead, nb2listw(w)) #the nb2listw function takes in an nb object and makes it useable in the moran.test function.
##
## Moran I test under randomisation
##
## data: meuse_sf$lead
## weights: nb2listw(w)
##
## Moran I statistic standard deviate = 11.512, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.406297811 -0.006134969 0.001283454
But what do we see here? If use the eight nearest neighbors to calculate spatial autocorrelation the value of Moran’s I jumps way up. Near is like near when it comes to lead. Kind of cool. If we looked at 16 neighbors we’d see Moran’s I of 0.28 which is still pretty high. It we looked at 32 neighbors? Moran’s I of 0.123. So, more neighbors, greater distances we get declining values. See where I’m going with this?
n <- 7
res <- data.frame(k=2^(1:n),I=rep(NA,n))
for(i in 1:n){
w <- knn2nb(knearneigh(meuse_sf,k=2^i))
res$I[i] <- moran.test(meuse_sf$lead,nb2listw(w))$estimate[1]
}
ggplot(data=res, mapping = aes(x=k,y=I)) +
geom_hline(yintercept = 0, linetype="dashed") +
geom_line() + geom_point(size=3) +
labs(x="K neighbors",y="Moran's I")
That’s a Moran’s I a correlogram by the number of neighbors. Because
these are points, I prefer to look at Moran’s I in a correlogram by
distance – like we did above but if we had to deal with polygonal data,
we’d likely use this approach. See text for more info.
ncfThe spdep library is fantastic but doesn’t produce the
correlogram in the way we often see it with Moran’s I by distance \((I=f(d))\). The ncf library
does this out of the box.
There are two functions we will look at to do this. One uses distance categories and one makes a continuous function.
meuseX <- st_coordinates(meuse_sf)[,1]
meuseY <- st_coordinates(meuse_sf)[,2]
meuseLead <- meuse_sf$lead
First, the discrete plot of Moran’s I by distance categories.
leadI <- correlog(x=meuseX, y=meuseY, z=meuseLead,
increment=100, resamp=100, quiet=TRUE)
plot(leadI,xlim=c(0,1500))
abline(h=0,lty="dashed")
We still see the same result we did above – positive autocorrelation out to distances of about 500 m, very slight negative autocorrelation around 1000 m, and CSR otherwise.
The above graph calculates \(I=f(d)\) using 100 m increments. There is also a significance test component with 100 permutations of the data to test the null hypothesis that the data above follows CSR. Colored dots above suggest significance (not CSR at those points), and hollow dots suggest insignificant. The test uses a two-sided 5% level and this code is way easier than the for-loop previously used.
Oh, one more thing. Remember how the point pattern functions in spatstat were very careful about not letting you run into edge effects? And variogram is that way too. The correlog function is not so careful! It’s more libertarian its approach.
Here is the whole data set. in the default plot.
plot(leadI)
abline(h=0, lty="dashed")
Here you can see that you get non-significant values with high values of I (a bunch of nonsense). We need to find ways to fix the edge effects.
YOU WANT TO LOOK AT DISTANCES LESS THAN \(\frac{1}3\) YOUR MAX SPAN BETWEEN POINTS
Let’s look at the extent of the distances in the data. The coordinates span about 2800 m by 3900 m:
max(meuseX)-min(meuseX)
## [1] 2785
max(meuseY)-min(meuseY)
## [1] 3897
meuseD <- dist(cbind(meuseX, meuseY))
max(meuseD)
## [1] 4440.764
The furthest point to point distance is about 4400 m. So we will decide on distances using this: \(\frac{4400}3 = 1500\) \(m\). Beyond that we actually run into a pairwise comparison and get an edge effect.
Oh, one more thing for the ggplot crowd. Note that all the data for
the correlogram is stored in an easy to access fashion in the
leadI object. You can roll your own plot in you want in
ggplot. And unlike the default plot above, we can control the alpha
level for plotting the significance from the permutation test. E.g.,
data.frame(n=leadI$n,
I = leadI$correlation,
d = leadI$mean.of.class,
p = leadI$p) %>%
mutate(Significant = p < .01) %>%
ggplot() +
geom_hline(yintercept = 0,linetype="dashed") +
geom_path(aes(x = d, y = I,group = 1, color=Significant),size=1) +
geom_point(aes(x = d, y = I, fill=Significant),size=5,pch=21) +
lims(x=c(0,1500),y=c(-0.6,0.6)) +
scale_fill_manual(values = c("steelblue","plum")) +
scale_color_manual(values = c("steelblue","plum")) +
labs(x="Distance (m)",y="Moran's I",
title = "Autocorrelation of Lead",subtitle = "Crit value of p<0.01")
The plot above is discrete. It is showing autocorrelation in distance categories. We can also calculate a continuous function:
leadI <- spline.correlog(x=meuseX, y=meuseY, z=meuse_sf$lead,
resamp=100, xmax=1500, quiet=TRUE)
plot(leadI)
This gives the same picture. You can read up on these functions to learn the subtleties. I’ll unpack this in greater detail in the lecture.