Introduction and Data

As Tobler (1970) states,

“[E]verything is related to everything else, but near things are more related than distant things” (p. 234)

This is “the first law of geography”, also known as Tobler’s Law stating that spatial patterns aren’t random. Keeping this in mind, for this project, I answered is there a significant spatial pattern of homicide rates and where? Have homicide rates changed significantly? if not, how have they stayed the same? The data set I’ll be using is provided by the center of Spatial Data Science’s (University of Chicago) sample data sets. It was also used in this paper). The data set includes information about homicides and certain socio-economic characteristics across US counties from 1960-1990’s I imported and extracted the ZIP folder in R, so I can read in this data and explore some of its characteristics.

#Homicides <- st_read("/cloud/project/NAT.SHP")


# will be using Leaflet, which expects the CRS to be 4326
Homicides <- st_read("C:/Users/smang/OneDrive - bates.edu/Spatial Stats with GIS using R/Spatial Stats with GIS using R/NAT.SHP", crs = 4326)

## Reading layer `NAT' from data source 
##   `C:\Users\smang\OneDrive - bates.edu\Spatial Stats with GIS using R\Spatial Stats with GIS using R\NAT.SHP' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3085 features and 69 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7314 ymin: 24.95597 xmax: -66.96985 ymax: 49.37173
## Geodetic CRS:  WGS 84

We’re dealing with area data here, specifically a polygon shape file, with 69 variables and 3,085 observations! Let’s explore some of its characteristics using the skim() function, which gives a summary of each variable in the data set and their distributions. The main variable I’ll be using for my analysis is hr, which is the homicide rate per 100,000. According to Baller et al. (2001):

“To avoid extreme heterogeneity, the rates are smoothed by taking a three-year average of the county homicide count centered on each decennial census year of the 1960-1990 period (e.g., 1959-1961). These averages are divided by the single-year census population figure (e.g., 1960).”

Therefore, although it’s not the most “accurate” picture, their calculated homicide rates are a pretty good average that’s already been standardized.

 options(scipen = 10)
 
 #only looking at the distribution of Southern counties and homicides rates from 1960-1990
 Homicides %>% 
   skimr::skim() %>% 
   filter(skim_variable %in% c("SOUTH", "HR60", "HR70", "HR80", "HR90"))

Data summary
Name	Piped data
Number of rows	3085
Number of columns	70
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
SOUTH	1	0.46	0.50	0.00	0.00	1.00	1.00	▇▁▁▁▇
HR60	1	4.50	5.65	0.00	2.78	6.89	92.94	▇▁▁▁▁
HR70	1	6.38	7.35	0.80	4.05	9.63	71.84	▇▁▁▁▁
HR80	1	6.93	6.83	1.93	5.23	10.23	59.13	▇▂▁▁▁
HR90	1	6.18	6.64	1.33	4.38	8.94	71.38	▇▁▁▁▁

I decided to look at the distribution of southern counties because Baller et al (2001) emphasized that homicides rates were clustered in the South. According to the code book for this data set, 17 states were scored 1 and considered the south: Washington D.C., Texas, Oklahoma, Arkansas, Louisiana, Mississippi, Alabama, Tennessee, Kentucky, Georgia, South Carolina, North Carolina, Florida, Virginia, West Virginia, Maryland, and Delaware. There are 49 states included in the data set, however, these 17 states consist of almost half of the data set, as you can see below. Additionally, according to the US 2020 census (therefore this isn’t exact and may be an overestimate because our data set is from 1960-1990), these states only make up about 38% of the US population. Keep in mind that three counties were dropped because they are islands (in addition to the states of Alaska and Hawaii, which make up 0.65% of the US population). Nonetheless, in the code book, they state that

“Due to changing county boundaries between 1959 and 1991, certain counties were merged in accordance with the Horan and Hargis county template (Horan and Hargis, 1995). Others were merged because certain data sources combined counties in a particular way.”

Out of the 20 “Fipsno” (which is calculated as: [State fips * 1000] + County fips) codes that includes aggregated counties, 12 of them are in the South (11 in Virginia and 1 in Georgia), according to this data set. Therefore, this can be an explanation for the distribution of Southern counties in the data set.

hist(Homicides$SOUTH)

Additionally, based on the mean, the percentiles, and the histogram (skewed to the right) produced by the skim() function, it looks like we’re dealing with very small homicide rates here with a few outliers from the 1960s to the 90s. This will affect how our maps look, so below I’ll be using a log transformation to see if that gives us a more even distribution of the data. Instead of using the log() on the whole variable, I used it on homicide rates which weren’t 0, because log(0) is - Inf, and I know these values will cause trouble when I’m mapping.

Homicide Rates Per 100,000 Distribution

Click on each tab to see the differences between the distribution of logged homicide rates from 1960-1990.

1960

Homicides <- Homicides %>% 
  mutate(logHR60 = case_when(
    HR60 != 0 ~ log(HR60),
    TRUE ~ 0
  ), logHR70 = case_when(
    HR70 != 0 ~ log(HR70),
    TRUE ~ 0
  ), logHR80 = case_when(
    HR80 != 0 ~ log(HR80),
    TRUE ~ 0
  ), logHR90 = case_when(
    HR90 != 0 ~ log(HR90),
    TRUE ~ 0
  ))

hist(Homicides$logHR60)

1970

hist(Homicides$logHR70)

1980

hist(Homicides$logHR80)

1990

hist(Homicides$logHR90)

Relative to normal distributions, these graphs look unusual and have a lot of 0 values, because I haven’t logged the zero values. The distribution is also more to the right than normal for years 1970-90, however these histograms look way better than before. I’ll be implementing this in my analysis and comparing to see if it improves the choropleth maps as well.

Analysis and Results

Choropleth Maps

First, I’ll create a choropleth map of the homicide rates in the 1960s, without a log transformation, using leaflet() and compare it to a map of logged homicide rates. It’s not possible to zoom in using geom_sf, whereas for leaflet(), you can because its interactive. You can click on a county on the map to get the state its in, the FIPSNO, whether its classified as the South, and the homicide rate per 100,000.

#creating palette for leaflet map
pal60 <- colorNumeric("YlOrRd", Homicides$HR60, reverse = TRUE)
  
#creating the map now, similar to tmap()
Homicides %>% 
  leaflet() %>% 
  addTiles() %>% 
  addPolygons(color = "black",
              fillColor = ~pal60(HR60),
              fillOpacity = 0.6,
              weight = 2,
              highlight = highlightOptions(
                weight = 3,
                fillOpacity = 0.9,
                bringToFront = T),
              popup = paste0("State: ", Homicides$STATE_NAME, "<br>",
                             "FIPSNO: ", Homicides$FIPSNO, "<br>", 
                             "South (1 for Yes and 0 for No)? ", Homicides$SOUTH, "<br>",
                             "Homicide Rate per 100,000: ", Homicides$HR60)) %>% 
  addLegend(pal = pal60,
            values = ~HR60,
            opacity = 0.7,
            title = "1960 Homicide Rate per 100,000 in the US",
            position = "topleft")

Keep in mind for this choropleth map and the ones below that brighter areas on the map (very light orange, yellow) means on average, that county has a higher homicide rate per 100,000 during that time period. When I use just the raw numbers, the homicide rates looks very low across the country, with only two counties in Texas having relatively high homicide rates per 100,000. Lets produce this same leaflet() map but with a log transformation and compare it to the other years. Click on each tab to see the difference between the maps of logged Homicide rates from 1960-1990. Keep in mind that although I mapped the logged homicide rates, when you click on a county, the actual homicide rate per 100,000 will appear to avoid confusion.

1960

# logHR60 <- log(Homicides$HR60)
# Homicides <- cbind(Homicides, logHR60)

#creating palette for leaflet map
pallog60 <- colorNumeric("YlOrRd", domain = Homicides$logHR60, reverse = TRUE)
  
#creating the map now, similar to tmap()
Homicides %>% 
  leaflet() %>% 
  addTiles() %>% 
  addPolygons(color = "black",
              fillColor = ~pallog60(logHR60),
              fillOpacity = 0.6,
              weight = 2,
              highlight = highlightOptions(
                weight = 3,
                fillOpacity = 0.9,
                bringToFront = T),
              popup = paste0("State: ", Homicides$STATE_NAME, "<br>",
                             "FIPSNO: ", Homicides$FIPSNO, "<br>", 
                             "South (1 for Yes and 0 for No)? ", Homicides$SOUTH, "<br>",
                             "Homicide Rate per 100,000: ", Homicides$HR60)) %>% 
  addLegend(pal = pallog60,
            values = ~logHR60,
            opacity = 0.7,
            title = "1960 Homicide Rate per 100,000 (Logged) in the US",
            position = "topleft")

1970

#creating palette for leaflet map
pallog70 <- colorNumeric("YlOrRd", domain = Homicides$logHR70, reverse = TRUE)
  
#creating the map now, similar to tmap()
Homicides %>% 
  leaflet() %>% 
  addTiles() %>% 
  addPolygons(color = "black",
              fillColor = ~pallog70(logHR70),
              fillOpacity = 0.6,
              weight = 2,
              highlight = highlightOptions(
                weight = 3,
                fillOpacity = 0.9,
                bringToFront = T),
              popup = paste0("State: ", Homicides$STATE_NAME, "<br>",
                             "FIPSNO: ", Homicides$FIPSNO, "<br>", 
                             "South (1 for Yes and 0 for No)? ", Homicides$SOUTH, "<br>",
                             "Homicide Rate per 100,000: ", Homicides$HR70)) %>% 
  addLegend(pal = pallog70,
            values = ~logHR60,
            opacity = 0.7,
            title = "1970 Homicide Rate per 100,000 (Logged) in the US",
            position = "topleft")

For some reason, some values fell outside of the scale here, so they were treated as NA.

1980

#creating palette for leaflet map
pallog80 <- colorNumeric("YlOrRd", domain = Homicides$logHR80, reverse = TRUE)
  
#creating the map now, similar to tmap()
Homicides %>% 
  leaflet() %>% 
  addTiles() %>% 
  addPolygons(color = "black",
              fillColor = ~pallog80(logHR80),
              fillOpacity = 0.6,
              weight = 2,
              highlight = highlightOptions(
                weight = 3,
                fillOpacity = 0.9,
                bringToFront = T),
              popup = paste0("State: ", Homicides$STATE_NAME, "<br>",
                             "FIPSNO: ", Homicides$FIPSNO, "<br>", 
                             "South (1 for Yes and 0 for No)? ", Homicides$SOUTH, "<br>",
                             "Homicide Rate per 100,000: ", Homicides$HR80)) %>% 
  addLegend(pal = pallog80,
            values = ~logHR80,
            opacity = 0.7,
            title = "1980 Homicide Rate per 100,000 (Logged) in the US",
            position = "topleft")

1990

#creating palette for leaflet map
pallog90 <- colorNumeric("YlOrRd", domain = Homicides$logHR90, reverse = TRUE)
  
#creating the map now, similar to tmap()
Homicides %>% 
  leaflet() %>% 
  addTiles() %>% 
  addPolygons(color = "black",
              fillColor = ~pallog90(logHR90),
              fillOpacity = 0.6,
              weight = 2,
              highlight = highlightOptions(
                weight = 3,
                fillOpacity = 0.9,
                bringToFront = T),
              popup = paste0("State: ", Homicides$STATE_NAME, "<br>",
                             "FIPSNO: ", Homicides$FIPSNO, "<br>", 
                             "South (1 for Yes and 0 for No)? ", Homicides$SOUTH, "<br>",
                             "Homicide Rate per 100,000: ", Homicides$HR90)) %>% 
  addLegend(pal = pallog90,
            values = ~logHR90,
            opacity = 0.7,
            title = "1990 Homicide Rate per 100,000 (Logged) in the US",
            position = "topleft")

It’s much easier to tell the difference between counties in all four of the logged maps (although that also may make it harder to interpret). They also look noticeable brighter in the South, meaning that homicide rates per 100,000 are higher there relative to the rest of the country.

Spatial Autocorrelation

To confirm the comments I made above about the South, I’m going to measure if there’s any spatial autocorrelation in these maps (specifically in the Southeast region), how unusual they are, and if not, how they’ve changed over time. First, I’m going to specify what spatial structure I’m going to use. I’m going to use the contiguity based neighbors approach.

#create a neighbors list using the poly2nb method
Homicides_nb <- poly2nb(Homicides)

#a matrix with 3,085 "zones" a.k.a. counties
Homicides_coords <- st_coordinates(st_centroid(st_geometry(Homicides)))

As reiterated above, Alaska and Hawaii were excluded from the data set, in addition to 3 other counties because they were islands. As a result, there will be no areas that are disjoint and without links, as we can see from the numerical summary of our neighborhood object and the map of it below.

#numerical summary of our neighborhood object
summary(Homicides_nb)

## Neighbour list object:
## Number of regions: 3085 
## Number of nonzero links: 18168 
## Percentage nonzero weights: 0.190896 
## Average number of links: 5.889141 
## Link number distribution:
## 
##    1    2    3    4    5    6    7    8    9   10   11   13   14 
##   24   36   91  281  620 1037  704  227   50   11    2    1    1 
## 24 least connected regions:
## 45 49 585 643 837 1197 1377 1402 1442 1464 1470 1523 1531 1532 1543 1596 1605 1653 1737 1767 1775 2892 2893 2919 with 1 link
## 1 most connected region:
## 1371 with 14 links

#visualize what it looks like
plot(st_geometry(Homicides), border = "grey")
plot(Homicides_nb, Homicides_coords, add = TRUE, col = "red")

Now I’m going to add weights into our W matrix.

contig_listw <- nb2listw(Homicides_nb, style = "B", zero.policy = TRUE)

Moran’s I

Now we’re going to calculate the Moran’s I, using a simple contiguity definition of neighborhood, and whether they’re statistically significant for all 4 years. As before, you can press the tabs below to observe the observed values and other statistics for each year.

1960

moran.test(Homicides$logHR60, listw = contig_listw, zero.policy = TRUE, randomisation = FALSE)

## 
##  Moran I test under normality
## 
## data:  Homicides$logHR60  
## weights: contig_listw    
## 
## Moran I statistic standard deviate = 45.833, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4800510407     -0.0003242542      0.0001098495

set.seed = (4560)
moran.mc(Homicides$logHR60, listw = contig_listw, zero.policy = TRUE, nsim = 99)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  Homicides$logHR60 
## weights: contig_listw  
## number of simulations + 1: 100 
## 
## statistic = 0.48005, observed rank = 100, p-value = 0.01
## alternative hypothesis: greater

1970

moran.test(Homicides$logHR70, listw = contig_listw, zero.policy = TRUE, randomisation = FALSE)

## 
##  Moran I test under normality
## 
## data:  Homicides$logHR70  
## weights: contig_listw    
## 
## Moran I statistic standard deviate = 49.515, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5186385973     -0.0003242542      0.0001098495

set.seed = (4564)
moran.mc(Homicides$logHR70, listw = contig_listw, zero.policy = TRUE, nsim = 99)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  Homicides$logHR70 
## weights: contig_listw  
## number of simulations + 1: 100 
## 
## statistic = 0.51864, observed rank = 100, p-value = 0.01
## alternative hypothesis: greater

1980

moran.test(Homicides$logHR80, listw = contig_listw, zero.policy = TRUE, randomisation = FALSE)

## 
##  Moran I test under normality
## 
## data:  Homicides$logHR80  
## weights: contig_listw    
## 
## Moran I statistic standard deviate = 45.146, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4728457784     -0.0003242542      0.0001098495

set.seed = (4568)
moran.mc(Homicides$logHR80, listw = contig_listw, zero.policy = TRUE, nsim = 99)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  Homicides$logHR80 
## weights: contig_listw  
## number of simulations + 1: 100 
## 
## statistic = 0.47285, observed rank = 100, p-value = 0.01
## alternative hypothesis: greater

1990

moran.test(Homicides$logHR90, listw = contig_listw, zero.policy = TRUE, randomisation = FALSE)

## 
##  Moran I test under normality
## 
## data:  Homicides$logHR90  
## weights: contig_listw    
## 
## Moran I statistic standard deviate = 43.783, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4585584835     -0.0003242542      0.0001098495

set.seed = (4576)
moran.mc(Homicides$logHR90, listw = contig_listw, zero.policy = TRUE, nsim = 99)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  Homicides$logHR90 
## weights: contig_listw  
## number of simulations + 1: 100 
## 
## statistic = 0.45856, observed rank = 100, p-value = 0.01
## alternative hypothesis: greater

For all 4 years, the observed I value is very unusual (relative to the expected value of -0.0003) and the z-scores are very very high. As shown by the p-vales (and the Monte Carlo randomization) it is highly unlikely that there’s no spatial autocorrelation and that these processes are random. The spatial distribution of high or low values in the data set is more spatially clustered than would be expected if this process were random. Thus, from 1960-990, we can confirm that there’s statistically significant positive global spatial autocorrelation, which shouldn’t be a surprise. I don’t know if it’s by chance, but the observed I value is higher in 1970 compared to the observed I values for the other years (which are centered around 0.46 - 0.48).

Local Spatial Autocorrelation

We can now use local statistics to help us identify where the clustered values are on our map for each year.

Moran Scatterplot (MSP)

I’m going to use standardized data when forming a MSP, by transforming the logged variables into z-scores. I’ve decided to use the logged variables instead of the variables in the data set, because it has a more normal distribution.

# LOGGED

#1960
mean_loghr60 <- mean(Homicides$logHR60)
sd_loghr60 <- sd(Homicides$logHR60)
z_loghr60 <- ((Homicides$logHR60) - mean_loghr60)/sd_loghr60

#1990
mean_loghr90 <- mean(Homicides$logHR90)
sd_loghr90 <- sd(Homicides$logHR90)
z_loghr90 <- ((Homicides$logHR90) - mean_loghr90)/sd_loghr90

#1970
mean_loghr70 <- mean(Homicides$logHR70)
sd_loghr70 <- sd(Homicides$logHR70)
z_loghr70 <- ((Homicides$logHR70) - mean_loghr70)/sd_loghr70

#1980
mean_loghr80 <- mean(Homicides$logHR80)
sd_loghr80 <- sd(Homicides$logHR80)
z_loghr80 <- ((Homicides$logHR80) - mean_loghr80)/sd_loghr80

Now I’m going to plot the Moran Scatter plot for the contiguity neighbors with these normalized values using moran.plot().

1960

msp60 <- moran.plot(z_loghr60, listw = contig_listw, zero.policy = TRUE)

In 1960, the values are mostly scattered around the upper right quadrant. When I looked at median homicide rates of the leverages, they were clustered high values in Southern counties. On the other hand, low values of the leverages were clustered around other low values in the Northeast and Midwest. However, as illustrated in the histograms of the data above, the values and zone identifiers in lower right quadrant indicate that there are high outliers clustered around low values.

1970

msp70 <- moran.plot(z_loghr70, listw = contig_listw, zero.policy = TRUE)

Most of the values are scattered around the upper right quadrant in 1970 as well. Like 1960, the leverage values are clustered in the South, the low leverage values are more present in the Midwest. The outlier leverages here are in different states, but in similar regions (the Midwest).

1980

msp80 <- moran.plot(z_loghr80, listw = contig_listw, zero.policy = TRUE)

The values in 1980 are still mostly scattered in the upper right quadrant however, there seems to more leverages and values overlapping in the lower right quadrant as well. The leverages here are also in the southern region, with 3 values in the Midwest. Although I’d consider the median homicide rates for Kansas, Nebraska, and South Dakota to be outliers in this data set, most of the low leverages are still present in the Midwest (excluding two values in the Northeast). Most of the high leverages here are present around other low values in the Midwest as well (Although I’d consider Idaho and California to be in the west).

1990

msp90 <- moran.plot(z_loghr90, listw = contig_listw, zero.policy = TRUE)

Consistent with the 1980’s, most of the values and leverages in the 90s are scattered in the upper and lower right quadrants and the leverages in the South. Again, I’d consider the median homicide rate for Minnesota to be outlier. Regardless, most of the low leverages here are present in the Midwest (excluding two values in the Northeast). The high leverages here are present around other low values in the Midwest.

Keep in mind that the data frames I looked at only represented the leverages that I was able to visibly see on the scatter plot, and thus, don’t necessarily represent where homicide rates are clustered locally. Additionally, for some reason, there seems to be a straight line of values just beneath the logged z score -1, but it may have to do with my distribution of 0 values in my data. This, in addition to the 3,000+ data points we have, also make these scatter plots hard to interpret.

LISA Map

It is no doubt that the maps above are very hard to interpret because there’s so many counties. Although I can access the values in the scatter plots above and map them, I decided to map the which quadrant each county was in, if that zones p-value was significant. This was done in the paper as well, and is a very helpful way of seeing where significant clusters are located on the map.

1960

local_contig60 <- localmoran(z_loghr60, listw = contig_listw, zero.policy = TRUE)
local_I60 <- local_contig60[,1]
Homicides <- cbind(Homicides, local_I60)


#create a new variable identifying the moran plot quadrant for each obs
msp60$quad_sig60 <- NA

# high-high quadrant
msp60[(msp60$x >= 0 & msp60$wx >= 0) & (local_contig60[, 5] <= 0.05), "quad_sig60"] <- "high-high"

# low-low quadrant
msp60[(msp60$x <= 0 & msp60$wx <= 0) & (local_contig60[, 5] <= 0.05), "quad_sig60"] <- "low-low"

# high-low quadrant
msp60[(msp60$x >= 0 & msp60$wx <= 0) & (local_contig60[, 5] <= 0.05), "quad_sig60"] <- "high-low"

# low-high quadrant
msp60[(msp60$x <= 0 & msp60$wx >= 0) & (local_contig60[, 5] <= 0.05), "quad_sig60"] <- "low-high"

# non-significant observations
msp60[(local_contig60[, 5] > 0.05), "quad_sig60"] <- "not signif."  


#Combing data frames and plotting the map
Homicides_lisamap <- cbind(Homicides[,1:13], msp60$quad_sig60) 

ggplot(Homicides_lisamap) +
  geom_sf(aes(fill = msp60.quad_sig60)) + 
   labs(title = "1960 Moran Scatterplot Map of Homicide Rates in the US", 
       fill = "MSP Quadrant")

1970

local_contig70 <- localmoran(z_loghr70, listw = contig_listw, zero.policy = TRUE)
local_I70 <- local_contig70[,1]
Homicides <- cbind(Homicides, local_I70)


#create a new variable identifying the moran plot quadrant for each obs
msp70$quad_sig70 <- NA

# high-high quadrant
msp70[(msp70$x >= 0 & msp70$wx >= 0) & (local_contig70[, 5] <= 0.05), "quad_sig70"] <- "high-high"

# low-low quadrant
msp70[(msp70$x <= 0 & msp70$wx <= 0) & (local_contig70[, 5] <= 0.05), "quad_sig70"] <- "low-low"

# high-low quadrant
msp70[(msp70$x >= 0 & msp70$wx <= 0) & (local_contig70[, 5] <= 0.05), "quad_sig70"] <- "high-low"

# low-high quadrant
msp70[(msp70$x <= 0 & msp70$wx >= 0) & (local_contig70[, 5] <= 0.05), "quad_sig70"] <- "low-high"

# non-significant observations
msp70[(local_contig70[, 5] > 0.05), "quad_sig70"] <- "not signif."  


#Combing data frames and plotting the map
Homicides_lisamap <- cbind(Homicides_lisamap, msp70$quad_sig70) 

ggplot(Homicides_lisamap) +
  geom_sf(aes(fill = msp70.quad_sig70)) + 
  labs(title = "1970 Moran Scatterplot Map of Homicide Rates in the US", 
       fill = "MSP Quadrant")

1980

local_contig80 <- localmoran(z_loghr80, listw = contig_listw, zero.policy = TRUE)
local_I80 <- local_contig80[,1]
Homicides <- cbind(Homicides, local_I80)


#create a new variable identifying the moran plot quadrant for each obs
msp80$quad_sig80 <- NA

# high-high quadrant
msp80[(msp80$x >= 0 & msp80$wx >= 0) & (local_contig80[, 5] <= 0.05), "quad_sig80"] <- "high-high"

# low-low quadrant
msp80[(msp80$x <= 0 & msp80$wx <= 0) & (local_contig80[, 5] <= 0.05), "quad_sig80"] <- "low-low"

# high-low quadrant
msp80[(msp80$x >= 0 & msp80$wx <= 0) & (local_contig80[, 5] <= 0.05), "quad_sig80"] <- "high-low"

# low-high quadrant
msp80[(msp80$x <= 0 & msp80$wx >= 0) & (local_contig80[, 5] <= 0.05), "quad_sig80"] <- "low-high"

# non-significant observations
msp80[(local_contig80[, 5] > 0.05), "quad_sig80"] <- "not signif."  


#Combing data frames and plotting the map
Homicides_lisamap <- cbind(Homicides_lisamap, msp80$quad_sig80) 

ggplot(Homicides_lisamap) +
  geom_sf(aes(fill = msp80.quad_sig80)) + 
  labs(title = "1980 Moran Scatterplot Map of Homicide Rates in the US", 
       fill = "MSP Quadrant")

1990

local_contig90 <- localmoran(z_loghr90, listw = contig_listw, zero.policy = TRUE)
local_I90 <- local_contig90[,1]
Homicides <- cbind(Homicides, local_I90)


#create a new variable identifying the moran plot quadrant for each obs
msp90$quad_sig90 <- NA

# high-high quadrant
msp90[(msp90$x >= 0 & msp90$wx >= 0) & (local_contig90[, 5] <= 0.05), "quad_sig90"] <- "high-high"

# low-low quadrant
msp90[(msp90$x <= 0 & msp90$wx <= 0) & (local_contig90[, 5] <= 0.05), "quad_sig90"] <- "low-low"

# high-low quadrant
msp90[(msp90$x >= 0 & msp90$wx <= 0) & (local_contig90[, 5] <= 0.05), "quad_sig90"] <- "high-low"

# low-high quadrant
msp90[(msp90$x <= 0 & msp90$wx >= 0) & (local_contig90[, 5] <= 0.05), "quad_sig90"] <- "low-high"

# non-significant observations
msp90[(local_contig90[, 5] > 0.05), "quad_sig90"] <- "not signif."  


#Combing data frames and plotting the map
Homicides_lisamap <- cbind(Homicides_lisamap, msp90$quad_sig90) 

ggplot(Homicides_lisamap) +
  geom_sf(aes(fill = msp90.quad_sig90)) + 
  labs(title = "1990 Moran Scatterplot Map of Homicide Rates in the US", 
       fill = "MSP Quadrant")

Based on all four maps above, high homicide rates are clustered in these states: Georgia, Virginia, North & South Carolina, Florida, Alabama, Kentucky, Arkansas, Tennessee, Mississippi, and Texas. This is also where the leverages were. Low homicide rates are clustered in Idaho, Illinois, Indiana, Iowa, Kansas, Minnesota, Nebraska, North & South Dakota, Wisconsin, New York and Pennsylvania (places where the low leverages were as well). Our Moran Scatter plots also have several values and leverages in the lower right quadrant, indicating negative autocorrelation. Globally, there’s still a positive regression line and Moran I value from the 1960s to the 90s, however, this means there were a number of high homicide rates (outliers) around low homicide rates. This usually took place in Kansas, Missouri, Indiana, and several other states in the Midwest, as the leverages showed above as well.

Conclusion

To reiterate what was stated above, most of the values from our Moran Scatter plots were in the upper right and lower left quadrants, indicating that there’s positive autocorrelation (as our observed Moran I values have confirmed) and that the high homicide rates were clustered around other high homicide rates (vice versa for low homicide rates). The map has proved to be useful as well by eliminating a lot of zones where the p-value isn’t significant. In the choropleth maps above, and, even the LISA maps (not pictured), showed a lot of variation which made them harder to analyze. When we look at areas with significant p-values, the maps and scatter plots above are consistent with these results (of positive spatial autocorrelation).

The final question remains: have spatial patterns in US homicide rates changed significantly? The answer is no. Generally, high homicide rates are concentrated in the southeast region of the US whereas low homicide rates are concentrated mostly in New York, Pennsylvania and the Midwest. Although some counties have fluctuated in terms of the quadrants they fall in (for example, the amount of counties in Virginia that were considered “high-high” went from 34 in 1960 to 14 in 1970), the spatial distribution of low and high homicide rates have stayed consistent. High homicide rates (outliers) among low ones is also pretty scattered through the Midwest region, with some fluctuations in the south and the Northeast. Regardless, I can conclude that the spatial patterns of homicide rates in the US haven’t changed significantly.

Why were there fluctuations in the homicide rates? What caused consistency across homicide rates from the 60s to the 90s? That would require further, more detailed, analysis. What I’ve done here was explore the presence of a spatial pattern among homicide rates and whether they’ve changed. However, Baller et al. (2001) has explored the questions I’ve mentioned above and if there was a relationship between covariates (all the other variables in the data set). They also provide future directions as well.

Limitations

Although a logged transformation of the homicide rates gave us a more normal distribution, it affected the distribution of zero values (the way I did it at least because I didn’t know how to get rid of -inf values) and I believe it affected the points on the scatter plot as well. Fixing this may alter the slope of our regressions lines. Additionally, it maybe would’ve been easier to analyze the maps if states instead of counties were used as a unit of analysis. As we can see in the choropleth maps above, there was a lot of heterogeneity in the map. However, using counties helped us see the high outliers present in the data, as those rates weren’t concentrated in one state [see Baller et al., 2001, p.569). Last but not least, as we can see on all four scatter plots, there are a lot of values that overlap, and are on the x axis, thus, some of counties may fit in “both” quadrants on the scatter plot, but not on the map (meaning it’s not a 100% accurate)

Takeaways

I’d like to end this project off with a thank you for the class. It was definitely difficult, because I was also doing research on the side. However, I truly enjoyed the work I was doing and am definitely going to look more into the field of spatial statistics, especially using R. Thank you for for your lessons, all your hard work, and not shying away from the limitations of all of the methods we’ve used (necessary!). Have a good one and again, thank you!

Word Count without code: ~2,500

References:

Baller et al. (2001). Structural Covariates of U.S. County Homicide Rates. Criminology. 39. 561 - 588. 10.1111/j.1745-9125.2001.tb00933.x.

Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46, 234–240. https://doi.org/10.2307/143141

Spatial Statistics with GIS using R: Final Project

Sokona Mangane

2022 August 15

Introduction and Data

Homicide Rates Per 100,000 Distribution

1960

1970

1980

1990

Analysis and Results

Choropleth Maps

1960

1970

1980

1990

Spatial Autocorrelation

Moran’s I

1960

1970

1980

1990

Local Spatial Autocorrelation

Moran Scatterplot (MSP)

1960

1970

1980

1990

LISA Map

1960

1970

1980

1990

Conclusion

Limitations

Takeaways