$\color{red}{\text{Notes for Group 6}}$

0.1 Group and Final Project Deadlines

Reports Final Drafts: $\color{red}{\text{May 16, 2021 Sunday 11:00pm EST}}$
Report Merge: May 17, 2021 Monday EST
Youtube Recording Prepared and Uploaded: May 19, 2021 Wed
Project Due Date: $\color{red}{\text{Submitted by May 20, 2021 Thursday 11:59pm EST}}$

Please don’t run tuning code or calibration code in the final version of your result. Just tune the model and give the final working version for inclusion. You can either comment out or add an eval=F flag to the relevant code chunk for such code.

0.2 Components

The raw data files are used to generate the cleaned data files. The cleaned data files are: crimesdiv.geojson and div.geojson. You will be able to use them for your data science work and ignore the 4 other raw files.

Exploratory Data Analysis $\color{red}{\text{Randy}}$

Visualizations are needed

Overlay map of census tracts and Town of Cary boundary limits with background tile from real map
Map of weighted average income by residential subdivision with annotations/title.
Map of total number of crimes by residential subdivision with annotations/title.

The eBook Geocomputation with R has a lot of good examples using tmap working with sf. (https://geocompr.robinlovelace.net/adv-map.html#interactive-maps)

Bar chart - Total Crime count by month from Jan 2018 - Nov 2020
Heat map - Crime Time by month from Jan 2018 - Nov 2020

Examples of traditional charts and heatmaps specific to crime data (https://rpubs.com/Rjacks14/633448)

Small multiples of crime maps by subdivision showing different crime categories.
If possible (optional) Interactive map showing income and crime by subdivision in shiny or leaflet using tmap.

Unsupervised Analysis $\color{red}{\text{Phil}}$

k-means clustering analysis to identify the centroid of each region. I suggest mapping each region to its equivalent centroid.

Interesting use of k-means clustering along with data visualization. But the distance metric is badly implemented and I wouldn’t follow this metric but the use of only 2 clusters is interesting.

(https://towardsdatascience.com/exploring-clustering-and-mapping-torontos-crimes-96336efe490f)

I suggest working hard to define a distance metric that includes:

the type of crime but may bucket the ucr codes into a small number of categories.
median household income
try using 2 (high vs low) and (4 districts) and more as needed.
i.e plain Euclidean distance of the actual points does not seem great.

Cluster the data in 2018 alone. Re-run on 2019 data and see if the cluster centroids are the still the same. Plot both diagrams. Color code the residential subdivisions in a map showing the k-means.

The benefit of k-means is not having to explain statistical significance or why you did not use spatial correlations.

Principal Components Analysis $\color{red}{\text{Alex}}$

Build a matrix of size 713 X 30 of subdivisions and attributes like crime_type and income. Run a PCA and interprete the first and second components a la UC-R article for US Arrests. (https://uc-r.github.io/pca) Will focus on using 2018 data.

Supervised learning: Spatial Regression Model $\color{red}{\text{Alex & TBD}}$

Not sure if we can finish this component by target date. The challenge is that regression models require handling spatial auto-correlation. A lot to enjoy!

Rspatial website has a good treatment spatial analysis and the supporting software. (https://rspatial.org/raster/analysis/1-introduction.html)

Bivand, Pebesma, Gomez-Rubio Applied Spatial Data Analysis with R (http://gis.humboldt.edu/OLM/r/Spatial%20Analysis%20With%20R.pdf)

Lansley, Cheshire An Introduction to Spatial Data Analysis and Visualisation in R (https://spatialanalysisonline.com/An%20Introduction%20to%20Spatial%20Data%20Analysis%20in%20R.pdf)

Introduction

This document analyzes municipal crime data for the Town of Cary, North Carolina using machine learning methods. In particular, we examine patterns in crime incidents over crime category and residential subdivision. We use both unsupervised machine learning methods to identify clusters and patterns of crime activity and supervised methods to predict patterns of activities.

You may wish to turn off the code in section 1

Section 1 describes the background and data sources.

Section 2 conducts data wrangling and interpolate data to obtain the dataset for our analysis. We construct and describe datasets in later sections.

Section 3 applies a k-means clustering algorithm to identify crime hot spots

Section 4 examines principal component analysis of the crime data

Section 5 examines spatial regression models to predict crime rates using the 2018 training and 2019 test split.

Section 6 concludes our remarks.

Section 7 presents our R code and technical appendices and references.

1 Background and Data Sources

1.1 Background

Cities and towns have good and bad neighborhoods. One common measure of neighborhood health is the level of crime. Another is household income. We wanted to investigate crime and income inequality by neighborhood in a mid-sized town from a statistical learning approach. Such towns have been less studied than larger metropolitan areas such as San Francisco, Chicago or New York. Such an investigation needs detailed crime and income data and a suitable definition of neighborhoods. It also requires the application of statistical learning methods to geospatial data along with the above mentioned attributes.

We investigate the Town of Cary in Wake County, North Carolina. This is a relatively prosperous middle sized boom town located in the Research Triangle area of North Carolina where technology and health companies are concentrated. The Town of Cary has an open data initiative. On the Town’s website, the Police Department publishes crime incident data compiled by its Police Department, residential subdivision boundary data and town boundaries. Lastly, recent US Census Bureau American Community Survey (ACS) data on household income is available to measure economic affluence.

$\color{red}{\text{More text to explain structure of the paper goes here.}}$

1.2 Data Sources

The data sources consists of 4 data files containing typical numerical and qualitative data and geospatial data.

Town of Cary Boundary File: cary_boundary.geojson which contains the geographic border information of the town.
Census Income Data: We used the American Community Survey (ACS) dataset for 2018 median household income on census tracts overlapping with the Town of Cary.
Residential Subdivisions: We use the file: httpmapstownofcary0.geojson obtained from the Town of Cary Open data portal to identify approximately 700 residential subdivisions within the Town. These correspond to neighborhoods of the Town.
Crime Incident Data from the Cary Police Department: We use the historical crime from the Town of Cary Police Department as of Nov 30, 2020. This file contains all historical crime incidents with geospatial data, date, crime category information. We will use the crime incident data from 2018 to 2020.

First we load the relatively small files containing town and residential subdivision and household income data.

Then we load the crime incident data. The raw dataset is quite large and requires some effort to import.

After importing all the raw files, we can examine their dimensions and contents below:

Raw data file dimensions
	town_limits	acs_income	subdivisions	crimes
number rows	1	47	717	101405
number columns	1	5	8	37
source	Town of Cary	Censusreporter.org	Town of Cary	Town of Cary Police Department

1.3 Data Issues and How We Address Them

To conduct this analysis, we had to address both the usual missing or bad data issues but also geospatial and methodological issues. We review each data source and discuss both types of issues jointly.

1.3.1 Town Boundaries

The Town of Cary straddles two counties and a significant part of the Town is zoned for commercial or business purposes. However, we determined that the residential subdivisions lie inside the Town thus avoiding the complex issue where subdivisions are split across two municipalities. This dataset provided in GeoJson format is mainly a control during exploratory data analysis. We rely on the residential subdivisions rather than the town boundary datafile to conduct our statistical learning analyses.

1.3.2 Residential Subdivisions

Residents in Cary typically live in residential subdivisions. These subdivisions are overseen by homeowner associations (HOA) but are not municipal administrative units. Such subdivisions are often built on farmland or previously vacant land lacking basic utilities like water, sewage or power prior to the development. Rather residential subdivisions are planned communities constructed by real estate development firms to be sold to homeowners.

We found and removed 4 subdivisions out of 717 where geometry data is missing. We added an identity column div_id to serve as a primary key because subdivision names did not appear to be unique.

## [1] "raw subdivisions data: 717 vs. tidied subdivisions data: 713"

1.3.3 Census Tract Level Household Income

Next, we obtained all census tracts covering the Town of Cary limits. It was necessary to construct a dataset with 47 census tracts to cover the Town of Cary limits. However, census tracts don’t generally align with residential subdivisions. Therefore, multiple census tracts may partition a residential subdivision. The data was obtained from the non-profit censusreporter.org website as a Geojson file. There was no missing data but the median household income data element was renamed from the obscure B19013001 to income for clarity. The data comes from the 2018 ACS dataset. We display some representative rows below while simplifying the dataset further.

Lastly, the table below shows the median household income in dollars within each census tract based on the ACS 2018 Survey.

geoid	tract_name	income	geometry
14000US37037020701	Census Tract 207.01, Chatham, NC	82833	MULTIPOLYGON (((1991620 760…
14000US37037020702	Census Tract 207.02, Chatham, NC	70938	MULTIPOLYGON (((1959672 696…
14000US37183053003	Census Tract 530.03, Wake, NC	68050	MULTIPOLYGON (((2070867 719…
14000US37183053004	Census Tract 530.04, Wake, NC	102143	MULTIPOLYGON (((2065252 723…
14000US37183053005	Census Tract 530.05, Wake, NC	155673	MULTIPOLYGON (((2060975 711…

Because our goal is to estimate median household income of a residential subdivision using census tract data, the non-alignment of these two geographical datasets poses a challenge. We estimate the median household income $f(S)$ for a residential subdivision $S$ by taking a weighted average of the median household income of all census tracts $c$ that intersect a given residential subdivision. More formally, we define:

\[ f(S) = \sum_{ c \cap S \neq \emptyset } \frac{\mu(c \cap S) }{\mu(S)} \text{Income}(c) \] where $\mu(A)$ is the area of a region $A$.

To construct this weighted average requires geospatial intersections and dataframe joins which we do using the sf geospatial library and tidyverse for typical dataframe operations.

We observe that about 6% of all subdivisions 43 out of 713 belong to more than one census tract. Of those, only 4 belong to 3 census tracts. Moreover, the median household income range across the census tracts is not extreme.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   48314   82946  109628  104216  123613  169600       1

The lowest income is $48,314 while the highest is $169,600 with a median of $109,628.

## [1] 756  12

We compute the weighted average income below and then save its value back to the residential subdivision dataset. Sample rows are illustrated below.

div_id	category	name	shape_stlength	shape_starea	approvedlots	description	geo_point_2d	avg_income	geometry
1	Single-Family Detached	Southwick	4109.284	801167.3	49	Single-Family Detached	35.77878, -78.84472	131875.00	MULTIPOLYGON (((2045503 737…
2	Single-Family Detached	Forest Creek	7206.395	1847059.2	92	Single-Family Detached	35.71299, -78.78806	155673.00	MULTIPOLYGON (((2063243 713…
3	Single-Family Detached	Stonehaven	5661.140	805679.6	52	Single-Family Detached	35.75736, -78.76810	95867.19	MULTIPOLYGON (((2067795 730…

1.3.4 Crime Incidents

Crime incident data is a large historical dataset. The raw data file, obtained in Dec 2020, contains crime incident data from 1977 to 2020. We have decided to use only the most recent years (2018-2020).

This time limitation still produces a large number of observations (over 13000) but avoids two problems. One is the data quality issues arising from inconsistent reporting especially during its early years. The second, equally important, is that the Town has grown rapidly from the early 2000s. New residential subdivisions have sprouted up over the last two decades. Thus, the absence of crime in a residential subdivision could mean that it was not yet built. Unfortunately, we don’t have historical data on the creation of residential subdivisions. So we are restricted to doing a point-in-time study.

We also filter the incidents by data quality. A significant number of crime incidents lack geometry data which means we cannot find the location within a subdivision. We purge those observations but they are less than half. In recent years, we find geospatial data to be quite completely.

Lastly, we restrict the crime dataset to those incidents occuring in the residential zones of the town. Roughly forty percent of crime appears to occur in commercial, business or non-residential zones. These are incidents within the town limits but not within any residential subdivision. These are important for quality of life issues within the town but less important for differentiating between residential subdivisions.

## [1] 13732    12

	incident_number	newdate	ucr	crime_type	crime_category	geocode	district	geometry
1	18001463	2018-02-17	9914	ALL OTHER - ALL TRAFFIC EXCEPT DWI (NON-UCR)	ALL OTHER	HAMPTON WOODS LN	D1	POINT (2074540 743896.3)
3000	20001132	2020-02-03	23C	LARCENY - SHOPLIFTING	LARCENY	CROSSROADS BLVD	D3	POINT (2077415 731775)
4000	20002505	2020-03-17	90Z	ORDINANCE - ANIMAL BITE	ALL OTHER	CARPENTER FIRE STATION RD	D2	POINT (2030835 754235)
6000	18000128	2018-01-04	220	BURGLARY - FORCIBLE ENTRY	BURGLARY	PETTY FARM RD	D2	POINT (2036305 758545)

In the data table above, we illustrate some representative rows and columns from the crime data table. We briefly comment on the columns:

The incident_number serves as the unique key in the crime dataset.
The crime is recorded on newdate.
The ucr code follows the FBI Uniform Crime Reporting Program standards in classifying incidents but also uses a small number of non-standard codes which appear to be unique to the Town.
The crime_type is the most granular but has a many-to-one relationship to the ucr code.
The district represents one of the four council districts which partitions the Town.

Finally, we inner join the crime incidents data to the residential subdivision and median household income data previously discussed to obtain our cleansed dataset crimes_div. We can see below that the total number of crimes in 2018 , 2019 and a partial 2020 are quite stable below.

## [1] 6239   21

Finally, we export the enriched crimes data and residential subdivision dataset for re-use later.

1.3.5 Coordinate Transformations

We change the coordinate reference system from a geometric crs to a projected CRS on all datasets. We see EPSG=2264 which is the NAD83/North Carolina coordinate system commonly used by the state agencies. The transition from a geographic CRS to a projected CRS means distances and areas are easier to measure. EPSGO.io (https://epsg.io/2264-15835) states the projection accuracy is within 1.0 meter on a bounding box around the entire state.

1.3.6 Cleansed Crime Data

In this final section, we reload the cleansed data files representing the two dataframes for our modeling work. We load crimesdiv which contains the crime incident data enriched with geographical location, associated residential subdivision and weighted average median household income. We also load the dataset of residential subdivisions div.

## Reading layer `crimesdiv' from data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/622_MACHINE_LEARNING_2021_SPRING/FINAL_PROJECT/work/crimesdiv.geojson' using driver `GeoJSON'
## Simple feature collection with 6239 features and 19 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 2017383 ymin: 692785 xmax: 2080146 ymax: 769015
## Projected CRS: NAD83 / North Carolina (ftUS)

## Reading layer `div' from data source `/Volumes/GDRIVE_SSD/homes/alex/datascience/622_MACHINE_LEARNING_2021_SPRING/FINAL_PROJECT/work/div.geojson' using driver `GeoJSON'
## Simple feature collection with 713 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 2016873 ymin: 691265.6 xmax: 2080271 ymax: 769129.4
## Projected CRS: NAD83 / North Carolina (ftUS)

2 Visualization and Statistical Summary

3 k-Means Clustering

4 Principal Components Analysis

5 Spatial Regression

The literature on criminal prediction in machine learning describes several ways to apply municipal crime report data. Kounadi, Ristea, et al.(2020) One approach predicts crime hot spots in real-time for various types of crimes like burglaries, assaults. Other approaches look at spatial-temporal patterns across an entire city.

In this section, we adopt the approach of predicting aggregate crime levels within the residential subdivisions of the town over a single year from static characteristics and past crime levels. There are two reasons to restrict regression analyses in this manner:

the town has expanded dramatically in the last decade - so many neighborhoods did not exist a decade ago. But, the data to track town expansion is readily available.
this investigation’s focus is on the role of the residential subdivision and the role of its spatial dependencies on crime levels.

This section is organized as follows. First, we aggregate crime incidents by calendar year and residential subdivisions from the previously constructed incident dataset. Then, we check for spatial dependencies in the data and then estimate a OLS regression model and a spatial lag model. Lastly, we interpret the spatial model coefficients and compare them to the OLS coefficients.

5.1 Data Aggregation By Residential Subdivision

We describe data correction and data aggregation steps.

To construct our regression data set, we needed to fix missing values identified in four observations. Three subdivisions Arrington Woods, Summercrest Two and Searstone lacked values for the feature approvedlots. We estimate a simple linear regression to impute the approvedlots from land use shape_starea for each subdivision categories which are Condo, Single Family Detached, and Mixed Residential. In Cary, Single Family Detached subdivisions use on average 12244 square feet (or 0.3 acres) per lot while Condo and Mixed Residential uses much less land per lot since they are often apartment buildings. The intuitively plausible estimates were cross checked by Internet search of each subdivision. We show the values in the table below.

## 
## Call:
## lm(formula = div$shape_starea ~ div$approvedlots)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4433751  -373566  -119874   318529 19801364 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      218299.4    76811.9   2.842  0.00461 ** 
## div$approvedlots  12244.8      525.4  23.307  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1535000 on 711 degrees of freedom
## Multiple R-squared:  0.4331, Adjusted R-squared:  0.4323 
## F-statistic: 543.2 on 1 and 711 DF,  p-value: < 2.2e-16

Residential Subdivisions with Imputed Approved Lots
name	category	approvedlots
Arrington Woods	Condo	11
Summercrest Two	Single Family Detached	33
Searstone	Mixed Residential	573

One observation Brickyard had a missing category which we manually assigned to Townhome by an Internet search. The frequency table of subdivision categories is shown as follows.

Var1	Freq
Apartment	67
Condo	19
Mixed Residential	4
Mobile Home	2
Nursing Home	4
Retirement Community	10
Single-Family Detached	481
Townhome	126

Our data aggregation process adds a total incident count column 2018, 2019, 2020 for each calendar year to the subdivision dataframe. We illustrate the data aggregation visually by density plot below. The distribution

Lastly, we visualize each year’s crime level geographically below. We can see the clear presence of high crime levels in a small number of subdivisions that persist across the 3 years while most subdivisions are low crime. We also see the presence of some clusters that suggest spatial dependencies.

5.2 Regressions

We construct regression models to predict crime levels 2019 in terms of static variables and 2018 crime levels. We run both a standard OLS model that ignores spatial dependencies and a spatial lag model includes those dependencies in the analysis. The OLS results are reported below.

## 
## Call:
## lm(formula = `2019` ~ `2018` + avg_income + approvedlots + shape_starea + 
##     factor(category), data = regdivdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7876  -1.0930  -0.3515   0.6842  18.8328 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             3.448e+00  5.526e-01   6.240 7.56e-10
## `2018`                                  4.646e-01  2.919e-02  15.915  < 2e-16
## avg_income                             -1.036e-05  3.173e-06  -3.265  0.00115
## approvedlots                            1.097e-02  1.658e-03   6.618 7.22e-11
## shape_starea                            1.844e-08  7.955e-08   0.232  0.81674
## factor(category)Condo                  -1.876e+00  7.238e-01  -2.592  0.00974
## factor(category)Mixed Residential      -1.364e+00  1.367e+00  -0.998  0.31858
## factor(category)Mobile Home             1.301e+01  1.872e+00   6.946 8.59e-12
## factor(category)Nursing Home           -3.353e+00  1.354e+00  -2.476  0.01353
## factor(category)Retirement Community   -2.308e+00  9.134e-01  -2.527  0.01172
## factor(category)Single-Family Detached -2.145e+00  4.682e-01  -4.581 5.47e-06
## factor(category)Townhome               -2.409e+00  4.744e-01  -5.078 4.90e-07
##                                           
## (Intercept)                            ***
## `2018`                                 ***
## avg_income                             ** 
## approvedlots                           ***
## shape_starea                              
## factor(category)Condo                  ** 
## factor(category)Mixed Residential         
## factor(category)Mobile Home            ***
## factor(category)Nursing Home           *  
## factor(category)Retirement Community   *  
## factor(category)Single-Family Detached ***
## factor(category)Townhome               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.596 on 701 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.6715 
## F-statistic: 133.3 on 11 and 701 DF,  p-value: < 2.2e-16

To run a spatial regression, we first have to build a spatial weight matrix based on the dependencies between subdivisions. We have a choice of contiguity by region or distanced based locations. We ruled out two standard methods after experimentation.

Regional contiguity left too many isolated subdivisions (over 90) which was unacceptable because these subdivisions would have zero contribution to spatial dependencies.
Distance banded weight matrix - required a very large distance threshold to ensure no isolated subdivisions.

We settled on a k-nearest neighbor spatial dependency matrix for $k=6$. Among the neighbors, we define an inverse distance weighted measure and then row-standardized the weights for each observation (subdivision). In particular, we define the raw weight $rw[i,j]$ of a subdivision $i$ to its neighbor $j$ as:

\[rw[i,j] = \frac{1}{\lVert c(i)-c(j)\rVert^{\alpha}} \text{ where } \alpha = 0.5\] Then we define the row standardized weights $w[i,j]$ in a matrix $W$ as:

\[w[i,j] = \frac{rw[i,j]}{\sum_{h=1}^{N} rw[i,h]} \text{ where } N = \text{ number of subdivisions} \]

Surprisingly, $\alpha=2$ causes the Lagrange multipler spatial dependency tests to fail to be statistically sigificant whereas $\alpha=1$ and $\alpha=0.5$ produces strong results. Because the distance-based weights are not scale independent, (we use feet to measure distances), we attribute this statistical significance effect to the fact the decay rate of the weights. A large $\alpha$ makes the weights decay too fast.

[Applications of Spatial Weights: R Notes by Luc Anselin and Grant Morrison] (https://spatialanalysis.github.io/lab_tutorials/Applications_of_Spatial_Weights.html#spatially-lagged-variables)

We plot the spatial weights matrix below with edges in red to show the connection strength and each dot representing the centroid of a subdivision plotted by its geographic coordinates. We see the presence of some connected components but no isolates.

To identify the type of spatial regression model to run, we use the Lagrange multipler tests. To identify the presence of spatial dependencies, we use Moran’s I-statistic. The Lagrange multiplier test for the spatial weights $W$ and above OLS model show that a spatial lag model is appropriate with p-value of 0.88% significance but a spatial errors model is not with a p-value of 86% significance.

## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = `2019` ~ `2018` + avg_income + approvedlots +
## shape_starea + factor(category), data = regdivdata)
## weights: wts_k6_a2
## 
## LMlag = 6.8475, df = 1, p-value = 0.008876
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = `2019` ~ `2018` + avg_income + approvedlots +
## shape_starea + factor(category), data = regdivdata)
## weights: wts_k6_a2
## 
## LMerr = 0.030031, df = 1, p-value = 0.8624

Moran’s I statistic on the response variable 2019 crime levels shows spatial dependencies exist. Clustering exists and is highly statistically significant. Moran’s I statistic shows a value of 0.17 while p-value is nearly 0.

## 
##  Moran I test under randomisation
## 
## data:  regdivdata$`2019`  
## weights: wts_k6_a2    
## 
## Moran I statistic standard deviate = 8.3061, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.171497419      -0.001404494       0.000433319

We summarize the spatial lags model below.

## 
## Call:lagsarlm(formula = `2019` ~ `2018` + avg_income + approvedlots + 
##     shape_starea + factor(category), data = regdivdata, listw = wts_k6_a2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11.06162  -1.10105  -0.31190   0.71921  18.53291 
## 
## Type: lag 
## Coefficients: (numerical Hessian approximate standard errors) 
##                                           Estimate  Std. Error z value
## (Intercept)                             2.9254e+00  5.8551e-01  4.9964
## `2018`                                  4.5637e-01  2.8982e-02 15.7468
## avg_income                             -8.6747e-06  3.2055e-06 -2.7062
## approvedlots                            1.0749e-02  1.6173e-03  6.6467
## shape_starea                            3.1606e-08  7.6713e-08  0.4120
## factor(category)Condo                  -1.7017e+00  7.1793e-01 -2.3704
## factor(category)Mixed Residential      -1.4572e+00  1.3508e+00 -1.0788
## factor(category)Mobile Home             1.2714e+01  1.8509e+00  6.8689
## factor(category)Nursing Home           -3.1718e+00  1.3387e+00 -2.3693
## factor(category)Retirement Community   -2.1735e+00  9.0251e-01 -2.4083
## factor(category)Single-Family Detached -2.0375e+00  4.6317e-01 -4.3990
## factor(category)Townhome               -2.3418e+00  4.6861e-01 -4.9975
##                                         Pr(>|z|)
## (Intercept)                            5.841e-07
## `2018`                                 < 2.2e-16
## avg_income                              0.006806
## approvedlots                           2.998e-11
## shape_starea                            0.680339
## factor(category)Condo                   0.017771
## factor(category)Mixed Residential       0.280661
## factor(category)Mobile Home            6.472e-12
## factor(category)Nursing Home            0.017820
## factor(category)Retirement Community    0.016025
## factor(category)Single-Family Detached 1.088e-05
## factor(category)Townhome               5.808e-07
## 
## Rho: 0.10234, LR test value: 6.1576, p-value: 0.013085
## Approximate (numerical Hessian) standard error: 0.04097
##     z-value: 2.4979, p-value: 0.012492
## Wald statistic: 6.2397, p-value: 0.012492
## 
## Log likelihood: -1682.663 for lag model
## ML residual variance (sigma squared): 6.5567, (sigma: 2.5606)
## Number of observations: 713 
## Number of parameters estimated: 14 
## AIC: 3393.3, (AIC for lm: 3397.5)

5.3 Discussion

The equation defining the model can be written as:

\[y=\rho W y + \mathbf{X}b + u\]

where the response variable $y$ is a $N \times 1$ vector. $\rho$ is the spatial autoregressive coefficient and $W$ is the $N \times N$ spatial weights matrix described earlier. $\mathbf{X}$ is the matrix of subdivision level data either crime in 2018 and static characteristic data. $u$ is the vector of random error terms.

We observe a low but significant spatial dependency between crime levels in adjoining residential subdivisions. The spatial lag regression is statistically significant as $\rho = 0.10$ and its p-value $p=1.31\%$ is significant at the 5% level.

Examining the coefficients, we see the past crime level 2018 is very statistically significant $p < 2.2 \times 10^{-16}$ at expected sign and magnitude. Likewise, the number of approved lots is also very significant $p=3.0 \times 10^{-11}$ while the coefficient is positive and small. Average income is highly significant and negatively related to crime levels as expected with $p=0.68%$. The area in the subdivision is not statistically significant and the coefficient is nearly zero. Lastly, we consider the category of subdivision. category=Mobile Home has a very strong positive association with higher crime levels while all other types of subdivisions are negatively related.

We conclude that a $100,000 increase in median household income (avg_income) would be associated with 0.86 fewer incidents per year after incorporating the past year’s crime level. By comparison with the OLS model, the coefficients are larger. A $100,000 increase in avg_income reduces crime by 1.03 incident per year. This difference is because the OLS model does not control for spatial clustering.

As a sensitivity check, if we exclude past crime levels as a predictor in the spatial lag model, the coefficient for avg_income grows larger. In that case, a $100,000 increase in avg_income reduces crime by 1.7 instead of .86 incidents per year per subdivision.

Interestingly, crime levels appear unrelated to the square footage of the subdivision shape_starea. Perhaps, crime rarely occurs on empty land but requires the inhabitants as perpetrators and victims.

However, crime level in the Town of Cary appear relatively low. The median number of crimes is 1 incident per year per subdivision while the mean is around 3. The relatively modest spatial autoregressive coefficient may be explained by two effects: the low overall level of crime - in some cases - 1 incident per year and the geographic dispersion of the subdivisions. Some subdivisions are geographically isolated from others. Adjacent subdivisions are often separated by landscaping and trees with no connecting paths or roads. Sometimes you cannot walk between subdivisions except along arterial public highways.

6 Conclusion

Spatial regression is a fruitful method to investigate crime incidents in residential subdivisions. A recent literature survey Kounadi, Ristea, et. al, 2020 found 193 papers in spatial crime forecasting. None of the top 32 papers used residential subdivisions as spatial unit of measure. Most papers used geographic grids or broader regions like districts, cities or counties. One challenge of using residential subdivisions is their heterogeneity of purposes and population and geographic dispersion. A benefit of using residential subdivisions is that they are actual communities - often defined by a homeowner association with like-minded residents.

We experimented with non-traditional spatial weight measures to overcome technical challenges to conducting spatial regressions and produce intuitively sensible model coefficients. We conjecture that spatial dependency in the Town is local rather than global.
A future direction of study is to use localized spatial regression analysis to analyze crime in such a sprawling town.

7 Appendices

7.1 References

National Incident-Based Reporting System (NIBRS) codes are described in the 2019 update: (https://www.fbi.gov/file-repository/ucr/ucr-2019-1-nibrs-technical-specification.pdf/view)

Vignettes for the use of sf package in R by Edzer Pebesma: (https://r-spatial.github.io/sf/articles/)

[Crime Prediction & Monitoring Framework Based on Spatial Analysis] (https://www.sciencedirect.com/science/article/pii/S187705091830807X)

[Dash, Safro, Srinivasamurthy 2018] Spatio-temporal prediction of crimes using network analytic approach (https://arxiv.org/abs/1808.06241)

Kounadi, Ristea, et al.(2020). A systematic review on spatial crime forecasting. Crime Science 9:7