Once one we will be studying the following law of geography:
“Everything is related to everything else, but near things are more related than distant things” ( Tobler (1970))
Spatial vs random features, source Gimond (2017)
The goal of today’s lecture is to objectively quantify the degree to which similar features cluster and where such clustering occurs (Gimond (2017)).
Loan Interest Distribution, do you see any patterns?
Mathematically speaking we can quantify the degree to which similar (or dissimilar) counties or places are clustered.
The Moran’s I statistic is the correlation coefficient for the relationship between a variable (like income) and its surrounding values.
We spent so much time talking about neighbors because we will need to define what the surrounding values are.
The Moran’s I coefficient between and indicator and the value of its neighbor (lag) is nothing more than the slope of the least squares regression line that best fits the points after having equalized the spread between both sets of data.
\(lag_y\) is the mean value of the neighbors of \(y\)
In terms of slopes (Gimond (2017))
In terms of spatial weights we have
We will rely on the following data
# Data 2014-2018 ACS estimates at the tract level in New York state
a<-read.csv("G:\\My Drive\\Phudcfily\\Syllabus PHUDCFILY\\GitHub PHUDCFILY\\SSEM\\Spatially constrained multivariate clustering\\Materials\\concentrated_disadvantage_tract.csv")Then we join data as before
## Loading required package: plyr
## Loading required package: sp
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## rgdal: version: 1.5-16, (SVN revision 1050)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: C:/Users/msgc/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: C:/Users/msgc/Documents/R/win-library/4.0/rgdal/proj
## Linking to sp version:1.4-2
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## Checking rgeos availability: TRUE
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
require(plyr)
options(tigris_use_cache = TRUE)
#Get shapefile polygons dataset
trc<-tracts("NY", class="sp")## Warning in proj4string(obj): CRS object has comment, which is lost in output
a$tract<-str_pad(a$tract, 6, pad = "0")
trc<-geo_join(trc, a, by_sp="TRACTCE", by_df="tract", how = "left")
head(trc)## class : SpatialPolygonsDataFrame
## features : 6
## extent : -73.87468, -73.78314, 40.70717, 40.74916 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=NAD83 +no_defs
## variables : 21
## names : STATEFP, COUNTYFP, TRACTCE, GEOID, NAME, NAMELSAD, MTFCC, FUNCSTAT, ALAND, AWATER, INTPTLAT, INTPTLON, tract, pct_single_mother, pct_african_american, ...
## min values : 36, 081, 044800, 36081044800, 448, Census Tract 448, G5020, S, 129758, 0, +40.7098547, -073.7869958, 044800, 0.00770712909441233, 0, ...
## max values : 36, 081, 046500, 36081046500, 465, Census Tract 465, G5020, S, 249611, 0, +40.7469665, -073.8710900, 046500, 0.260406582768635, 0.128636622932116, ...
trc<-trc[!is.na(trc$pct10kbelow)&!is.na(trc$pct_single_mother),]
trc<-trc[!is.na(trc$pct_unemployed_looking),]
summary(trc)## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x -79.76214 -71.84771
## y 40.49117 45.01586
## Is projected: FALSE
## proj4string : [+proj=longlat +datum=NAD83 +no_defs]
## Data attributes:
## STATEFP COUNTYFP TRACTCE GEOID
## Length:3928 Length:3928 Length:3928 Length:3928
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## NAME NAMELSAD MTFCC FUNCSTAT
## Length:3928 Length:3928 Length:3928 Length:3928
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## ALAND AWATER INTPTLAT INTPTLON
## Length:3928 Length:3928 Length:3928 Length:3928
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## tract pct_single_mother pct_african_american
## Length:3928 Min. :0.00000 Min. :0.00000
## Class :character 1st Qu.:0.08003 1st Qu.:0.01241
## Mode :character Median :0.12036 Median :0.04430
## Mean :0.14630 Mean :0.17320
## 3rd Qu.:0.18057 3rd Qu.:0.20586
## Max. :0.61111 Max. :0.98843
##
## pct_unemployed_looking pct_family5children zip pct_eitc
## Min. :0.00000 Min. :0.00000 Min. : 1002 Min. :0.00000
## 1st Qu.:0.03468 1st Qu.:0.03796 1st Qu.: 3108 1st Qu.:0.07895
## Median :0.05360 Median :0.06421 Median :10010 Median :0.12601
## Mean :0.06110 Mean :0.07715 Mean : 7992 Mean :0.15607
## 3rd Qu.:0.07893 3rd Qu.:0.09838 3rd Qu.:11373 3rd Qu.:0.22530
## Max. :0.42857 Max. :0.71383 Max. :55308 Max. :0.46271
## NA's :3
## dependent_density pct10kbelow
## Min. :0.02027 Min. :0.1081
## 1st Qu.:0.44670 1st Qu.:0.2578
## Median :0.52261 Median :0.2995
## Mean :0.53420 Mean :0.3195
## 3rd Qu.:0.61470 3rd Qu.:0.3746
## Max. :2.00316 Max. :0.6452
##
## Neighbour list object:
## Number of regions: 3928
## Number of nonzero links: 20874
## Percentage nonzero weights: 0.1352891
## Average number of links: 5.314155
## 34 regions with no links:
## 299 301 526 606 642 706 793 853 869 1283 1287 1444 1450 1457 1479 1497 1723 1796 1956 2037 3119 3404 3870 3905 3909 4063 4101 4182 4432 4449 4477 4796 4873 4878
## Link number distribution:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 26 29
## 34 92 201 384 609 795 750 586 276 107 55 23 6 4 1 1 1 1 1 1
## 92 least connected regions:
## 22 50 99 121 335 338 350 355 482 506 537 560 598 611 612 659 661 686 754 791 827 896 977 980 981 1009 1016 1087 1126 1194 1228 1365 1464 1481 1641 1671 1691 1692 1719 1788 1839 1873 2245 2285 2287 2290 2334 2385 2450 2453 2500 2507 2533 2536 2621 2783 2788 2891 3051 3092 3159 3176 3267 3466 3480 3504 3515 3555 3604 3663 3669 3691 3826 3885 3904 3910 3939 3989 4013 4099 4108 4164 4183 4187 4258 4307 4318 4368 4472 4510 4551 4787 with 1 link
## 1 most connected region:
## 1555 with 29 links
## Warning in lag.listw(ny.listw, trc@data$pct10kbelow): NAs in lagged values
M1 <- lm(trc@data$lag.pct10kbelow ~ trc@data$pct10kbelow)
plot( trc@data$lag.pct10kbelow ~ trc@data$pct10kbelow, pch=20, asp=1, las=1)
abline(M1, col="blue")##
## Moran I test under randomisation
##
## data: trc$pct10kbelow
## weights: ny.listw n reduced by no-neighbour observations
##
##
## Moran I statistic standard deviate = 32.164, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.3360493119 -0.0002568713 0.0001093270
The expectation of the Moran’s \(I\), indicating the value of \(I\) under a spatial random process is \(\frac{-1}{(n-1)}\)
The standard deviate is \(\frac{(I-E[I])}{\sqrt(var(I))}\)
How bad is the dependence problem?
# Examine Moran's I between observations and lags over several orders.
plot.spcor(sp.correlogram(ny.listw$neighbours, trc$pct10kbelow, order = 20, method = "I",zero.policy=T), xlab = "Spatial lags", main = "Spatial correlogram: Autocorrelation CIs")Anselin, Luc. 2020. Spatial Data Science. University of Chicago Center for Spatial Data Science. https://spatialanalysis.github.io/tutorials/.
Gimond, Manuel. 2017. “Intro to Gis and Spatial Analysis.” https://mgimond.github.io/Spatial/index.html.
Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (sup1): 234–40. http://www.geog.ucsb.edu/~tobler/publications/pdf_docs/A-Computer-Movie.pdf.