1 Summary

This lab report documents a spatial autocorrelation analysis conducted on the Age-Adjusted Colon Cancer Late-Stage Incidence Ratio (AgeAdjstRatio) for a 10-county Central Texas study area. The analysis addresses two core tasks: (1) construction and display of a spatial weights matrix \(W_{ij}\) based on queen contiguity adjacency, and (2) computation and significance testing of Moran’s I statistic.

The 10-county study area encompasses the Austin-San Marcos metropolitan corridor and surrounding counties: Travis, Hays, Williamson, Bastrop, Caldwell, Comal, Guadalupe, Bexar, Blanco, and Burnet. This cluster represents a contiguous geographic unit appropriate for spatial dependence analysis. Results indicate the presence (or absence) of spatial clustering in late-stage colon cancer incidence across this region, with implications for health geography research and policy targeting.

2 Description of Data Used

2.1 Spatial Data

The spatial data consist of a polygon shapefile (County.shp) representing all 254 Texas counties. Key attribute fields used in this analysis include:

Field	Description
`CNTY_NM`	County name
`CNTY_FIPS`	County FIPS code (character)
`geometry`	Polygon geometry (EPSG:4326)

The shapefile is sourced from a Texas county boundary dataset and was reprojected to EPSG:3082 (Texas Centric Albers Equal Area) for accurate adjacency calculation.

2.2 Tabular Data

The tabular dataset (GEO7301_Labs3to5_CountyVariablesColonLateStage2005to2014.xlsx) contains 245 Texas counties with 29 variables including:

Variable	Description
`NAME`	County name
`FIPS`	5-digit FIPS code (integer)
`AgeAdjstRatio`	Age-adjusted late-stage colon cancer incidence ratio (per 100,000)
`PopAtRisk`	Population at risk
`POPULATION`	Total county population
Various socioeconomic	Race/ethnicity, housing, family size, etc.

The variable of primary interest is AgeAdjstRatio, representing age-adjusted late-stage colon cancer incidence for the period 2005-2014. This variable is appropriate for spatial autocorrelation analysis given its health-geographic interpretive value and potential for spatial clustering driven by shared environmental, demographic, or healthcare access factors.

3 Methods and Analysis Procedures

3.1 R Packages

# Install packages if not already installed
pkgs <- c("sf", "spdep", "readxl", "dplyr", "ggplot2",
          "knitr", "kableExtra", "RColorBrewer", "viridis",
          "gridExtra", "scales")

for (p in pkgs) {
  if (!requireNamespace(p, quietly = TRUE)) install.packages(p)
}

library(sf)
library(spdep)
library(readxl)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(RColorBrewer)
library(viridis)
library(gridExtra)
library(scales)

cat("All packages loaded successfully.\n")

## All packages loaded successfully.

cat("R version:", as.character(getRversion()), "\n")

## R version: 4.3.3

cat("sf version:", as.character(packageVersion("sf")), "\n")

## sf version: 1.0.16

cat("spdep version:", as.character(packageVersion("spdep")), "\n")

## spdep version: 1.3.10

3.2 Data Loading

# ---- Paths ----------------------------------------------------------------
shp_path  <- file.path(out_dir, "County.shp")
xlsx_path <- file.path(out_dir,
  "GEO7301_Labs3to5_CountyVariablesColonLateStage2005to2014.xlsx")

# Copy source files to output directory if not already there
if (!file.exists(shp_path)) {
  # Adjust these source paths to match your local setup
  src_shp   <- list.files(".", pattern = "County\\.shp$", recursive = TRUE,
                           full.names = TRUE)[1]
  src_xlsx  <- list.files(".", pattern = "\\.xlsx$", recursive = TRUE,
                           full.names = TRUE)[1]
  shp_files <- list.files(dirname(src_shp),
                           pattern = "County\\.(shp|dbf|prj|shx|cpg)$",
                           full.names = TRUE)
  file.copy(shp_files, out_dir, overwrite = TRUE)
  file.copy(src_xlsx, out_dir, overwrite = TRUE)
}

# ---- Load shapefile -------------------------------------------------------
tx_counties <- st_read(shp_path, quiet = TRUE)
cat("Shapefile CRS:", st_crs(tx_counties)$epsg, "\n")

## Shapefile CRS: 4326

cat("Number of counties in shapefile:", nrow(tx_counties), "\n")

## Number of counties in shapefile: 254

# ---- Load Excel table -----------------------------------------------------
county_data <- read_excel(xlsx_path)
cat("Rows in Excel table:", nrow(county_data), "\n")

## Rows in Excel table: 245

cat("Variables:", ncol(county_data), "\n")

## Variables: 29

3.3 Defining the 10-County Study Area

The study area is defined as 10 contiguous Central Texas counties forming the Austin-San Marcos corridor and its periphery. This selection reflects geographic contiguity and regional relevance for public health spatial analysis.

# ---- Define the 10-county study area ------------------------------------
study_counties <- c(
  "Travis",     # Austin core
  "Hays",       # San Marcos / Wimberley
  "Williamson", # Round Rock / Georgetown
  "Bastrop",    # Bastrop
  "Caldwell",   # Lockhart
  "Comal",      # New Braunfels
  "Guadalupe",  # Seguin
  "Bexar",      # San Antonio
  "Blanco",     # Johnson City
  "Burnet"      # Burnet
)

cat("Study counties:\n")

## Study counties:

cat(paste(" -", study_counties, collapse = "\n"), "\n")

##  - Travis
##  - Hays
##  - Williamson
##  - Bastrop
##  - Caldwell
##  - Comal
##  - Guadalupe
##  - Bexar
##  - Blanco
##  - Burnet

3.4 Data Preparation and Joining

# ---- Reproject to Texas Centric Albers Equal Area (EPSG:3082) -----------
tx_counties_proj <- st_transform(tx_counties, crs = 3082)

# ---- Subset to study area ------------------------------------------------
study_shp <- tx_counties_proj %>%
  filter(CNTY_NM %in% study_counties)

cat("Counties retained in spatial subset:", nrow(study_shp), "\n")

## Counties retained in spatial subset: 10

# ---- Prepare FIPS key for join -------------------------------------------
# Shapefile FIPS is character (e.g., "48453"); Excel FIPS is integer (48453)
study_shp <- study_shp %>%
  mutate(FIPS_key = as.integer(CNTY_FIPS))

# ---- Join tabular data to spatial layer ----------------------------------
study_data <- county_data %>%
  filter(NAME %in% study_counties) %>%
  select(NAME, FIPS, AgeAdjstRatio, POPULATION, PopAtRisk,
         POP_SQMI, BLACK, HISPANIC, RENTER_OCC, VACANT)

study_sf <- study_shp %>%
  left_join(study_data, by = c("FIPS_key" = "FIPS"))

cat("\nJoined dataset dimensions:", nrow(study_sf), "rows x",
    ncol(study_sf), "cols\n")

## 
## Joined dataset dimensions: 10 rows x 26 cols

cat("NA in AgeAdjstRatio:", sum(is.na(study_sf$AgeAdjstRatio)), "\n")

## NA in AgeAdjstRatio: 0

3.5 Summary Statistics for Study Area

summary_df <- study_sf %>%
  st_drop_geometry() %>%
  select(NAME, AgeAdjstRatio, POPULATION, PopAtRisk, POP_SQMI) %>%
  arrange(desc(AgeAdjstRatio))

summary_df %>%
  kbl(
    caption = "Table 1: Descriptive Summary of Study Area Counties",
    digits  = 2,
    col.names = c("County", "Age-Adj Ratio", "Population", "Pop at Risk",
                  "Pop/sq mi"),
    align   = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 12
  ) %>%
  column_spec(2, bold = TRUE, color = "white",
              background = spec_color(summary_df$AgeAdjstRatio,
                                      palette = "YlOrRd",
                                      end = 0.85)) %>%
  add_header_above(c(" " = 1, "Health Variable" = 1, "Demographic Context" = 3)) %>%
  footnote(general = "AgeAdjstRatio = Age-adjusted late-stage colon cancer incidence ratio per 100,000 population, 2005-2014.")

Table 1: Descriptive Summary of Study Area Counties
	Health Variable	Demographic Context
County	Age-Adj Ratio	Population	Pop at Risk	Pop/sq mi
Caldwell	23.5	41606	379646	76.1
Guadalupe	22.3	152994	1285404	214.0
Bexar	21.1	1897710	17008499	1511.0
Bastrop	20.4	82757	733819	92.4
Hays	19.8	199764	1550533	294.0
Burnet	19.8	47418	426042	46.4
Comal	19.1	131083	1074227	227.9
Travis	18.3	1166555	10221115	1140.0
Williamson	18.2	506367	4150032	446.3
Blanco	17.6	11228	102567	15.7
Note:
AgeAdjstRatio = Age-adjusted late-stage colon cancer incidence ratio per 100,000 population, 2005-2014.

cat("=== Descriptive Statistics: AgeAdjstRatio ===\n")

## === Descriptive Statistics: AgeAdjstRatio ===

x <- study_sf$AgeAdjstRatio
stats_df <- data.frame(
  Statistic = c("N", "Min", "Max", "Mean", "Median", "SD", "CV (%)"),
  Value     = c(
    length(x),
    round(min(x), 2),
    round(max(x), 2),
    round(mean(x), 2),
    round(median(x), 2),
    round(sd(x), 3),
    round(sd(x)/mean(x)*100, 2)
  )
)

stats_df %>%
  kbl(caption = "Table 2: Descriptive Statistics for AgeAdjstRatio",
      align = c("l","r")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, position = "center")

Table 2: Descriptive Statistics for AgeAdjstRatio
Statistic	Value
N	10.000
Min	17.600
Max	23.500
Mean	20.010
Median	19.800
SD	1.876
CV (%)	9.380

3.6 Step 1: Spatial Weights Matrix Construction

The spatial weights matrix \(W_{ij}\) is constructed using queen contiguity (shared boundary or vertex). For \(n = 10\) counties, the binary adjacency matrix \(W\) is defined as:

\[W_{ij} = \begin{cases} 1 & \text{if counties } i \text{ and } j \text{ share a boundary or vertex} \\ 0 & \text{otherwise} \end{cases}\]

Row standardization transforms \(W\) so that each row sums to 1, yielding the row-stochastic matrix \(W^{(RS)}\), which is the standard form used in Moran’s I computation:

\[w_{ij}^{(RS)} = \frac{W_{ij}}{\sum_j W_{ij}}\]

# ---- Build queen contiguity neighbors list --------------------------------
nb_queen <- poly2nb(study_sf, queen = TRUE)

cat("=== Neighbor Summary (Queen Contiguity) ===\n")

## === Neighbor Summary (Queen Contiguity) ===

print(summary(nb_queen))

## Neighbour list object:
## Number of regions: 10 
## Number of nonzero links: 38 
## Percentage nonzero weights: 38 
## Average number of links: 3.8 
## Link number distribution:
## 
## 2 3 4 5 6 
## 1 3 4 1 1 
## 1 least connected region:
## 1 with 2 links
## 1 most connected region:
## 8 with 6 links

# ---- Assign region names for readability ----------------------------------
attr(nb_queen, "region.id") <- study_sf$CNTY_NM

# ---- Row-standardized spatial weights object ------------------------------
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = FALSE)

cat("\n=== Spatial Weights List Summary ===\n")

## 
## === Spatial Weights List Summary ===

print(summary(lw_queen))

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 10 
## Number of nonzero links: 38 
## Percentage nonzero weights: 38 
## Average number of links: 3.8 
## Link number distribution:
## 
## 2 3 4 5 6 
## 1 3 4 1 1 
## 1 least connected region:
## Bexar with 2 links
## 1 most connected region:
## Travis with 6 links
## 
## Weights style: W 
## Weights constants summary:
##    n  nn S0       S1       S2
## W 10 100 10 5.486111 41.00556

3.6.1 Display the Full Binary Adjacency Matrix

# ---- Convert to full binary matrix ----------------------------------------
W_binary <- nb2mat(nb_queen, style = "B", zero.policy = FALSE)
rownames(W_binary) <- study_sf$CNTY_NM
colnames(W_binary) <- study_sf$CNTY_NM

cat("=== Binary Spatial Weights Matrix W_ij (Queen Contiguity) ===\n\n")

## === Binary Spatial Weights Matrix W_ij (Queen Contiguity) ===

W_display <- as.data.frame(W_binary)
print(W_display)

##            Bexar Guadalupe Comal Caldwell Hays Bastrop Blanco Travis Williamson
## Bexar          0         1     1        0    0       0      0      0          0
## Guadalupe      1         0     1        1    1       0      0      0          0
## Comal          1         1     0        0    1       0      1      0          0
## Caldwell       0         1     0        0    1       1      0      1          0
## Hays           0         1     1        1    0       0      1      1          0
## Bastrop        0         0     0        1    0       0      0      1          1
## Blanco         0         0     1        0    1       0      0      1          0
## Travis         0         0     0        1    1       1      1      0          1
## Williamson     0         0     0        0    0       1      0      1          0
## Burnet         0         0     0        0    0       0      1      1          1
##            Burnet
## Bexar           0
## Guadalupe       0
## Comal           0
## Caldwell        0
## Hays            0
## Bastrop         0
## Blanco          1
## Travis          1
## Williamson      1
## Burnet          0

W_binary %>%
  as.data.frame() %>%
  kbl(
    caption = "Table 3: Binary Spatial Weights Matrix W_ij (Queen Contiguity, n = 10 Counties)",
    digits  = 0
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = TRUE,
    font_size         = 11
  ) %>%
  column_spec(1:11, width = "1.6cm") %>%
  footnote(general = paste0(
    "W_ij = 1 if counties i and j are queen-contiguous; 0 otherwise. ",
    "Diagonal entries are 0 by convention. ",
    "Column names abbreviated to county names."
  ))

Table 3: Binary Spatial Weights Matrix W_ij (Queen Contiguity, n = 10 Counties)
	Bexar	Guadalupe	Comal	Caldwell	Hays	Bastrop	Blanco	Travis	Williamson	Burnet
Bexar	0	1	1	0	0	0	0	0	0	0
Guadalupe	1	0	1	1	1	0	0	0	0	0
Comal	1	1	0	0	1	0	1	0	0	0
Caldwell	0	1	0	0	1	1	0	1	0	0
Hays	0	1	1	1	0	0	1	1	0	0
Bastrop	0	0	0	1	0	0	0	1	1	0
Blanco	0	0	1	0	1	0	0	1	0	1
Travis	0	0	0	1	1	1	1	0	1	1
Williamson	0	0	0	0	0	1	0	1	0	1
Burnet	0	0	0	0	0	0	1	1	1	0
Note:
W_ij = 1 if counties i and j are queen-contiguous; 0 otherwise. Diagonal entries are 0 by convention. Column names abbreviated to county names.

3.6.2 Display the Row-Standardized Weights Matrix

# ---- Row-standardized matrix ----------------------------------------------
W_rowstd <- nb2mat(nb_queen, style = "W", zero.policy = FALSE)
rownames(W_rowstd) <- study_sf$CNTY_NM
colnames(W_rowstd) <- study_sf$CNTY_NM

cat("=== Row-Standardized Spatial Weights Matrix W_ij^(RS) ===\n\n")

## === Row-Standardized Spatial Weights Matrix W_ij^(RS) ===

print(round(W_rowstd, 4))

##            Bexar Guadalupe Comal Caldwell   Hays Bastrop Blanco Travis
## Bexar       0.00      0.50  0.50   0.0000 0.0000  0.0000 0.0000 0.0000
## Guadalupe   0.25      0.00  0.25   0.2500 0.2500  0.0000 0.0000 0.0000
## Comal       0.25      0.25  0.00   0.0000 0.2500  0.0000 0.2500 0.0000
## Caldwell    0.00      0.25  0.00   0.0000 0.2500  0.2500 0.0000 0.2500
## Hays        0.00      0.20  0.20   0.2000 0.0000  0.0000 0.2000 0.2000
## Bastrop     0.00      0.00  0.00   0.3333 0.0000  0.0000 0.0000 0.3333
## Blanco      0.00      0.00  0.25   0.0000 0.2500  0.0000 0.0000 0.2500
## Travis      0.00      0.00  0.00   0.1667 0.1667  0.1667 0.1667 0.0000
## Williamson  0.00      0.00  0.00   0.0000 0.0000  0.3333 0.0000 0.3333
## Burnet      0.00      0.00  0.00   0.0000 0.0000  0.0000 0.3333 0.3333
##            Williamson Burnet
## Bexar          0.0000 0.0000
## Guadalupe      0.0000 0.0000
## Comal          0.0000 0.0000
## Caldwell       0.0000 0.0000
## Hays           0.0000 0.0000
## Bastrop        0.3333 0.0000
## Blanco         0.0000 0.2500
## Travis         0.1667 0.1667
## Williamson     0.0000 0.3333
## Burnet         0.3333 0.0000
## attr(,"call")
## nb2mat(neighbours = nb_queen, style = "W", zero.policy = FALSE)

round(W_rowstd, 4) %>%
  as.data.frame() %>%
  kbl(
    caption = "Table 4: Row-Standardized Spatial Weights Matrix W_ij (style = 'W')",
    digits  = 4
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 11
  ) %>%
  footnote(general = paste0(
    "Each row sums to 1.0. ",
    "w_ij = W_ij / sum_j(W_ij). ",
    "Used as input for Moran's I computation."
  ))

Table 4: Row-Standardized Spatial Weights Matrix W_ij (style = ‘W’)
	Bexar	Guadalupe	Comal	Caldwell	Hays	Bastrop	Blanco	Travis	Williamson	Burnet
Bexar	0.00	0.50	0.50	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Guadalupe	0.25	0.00	0.25	0.2500	0.2500	0.0000	0.0000	0.0000	0.0000	0.0000
Comal	0.25	0.25	0.00	0.0000	0.2500	0.0000	0.2500	0.0000	0.0000	0.0000
Caldwell	0.00	0.25	0.00	0.0000	0.2500	0.2500	0.0000	0.2500	0.0000	0.0000
Hays	0.00	0.20	0.20	0.2000	0.0000	0.0000	0.2000	0.2000	0.0000	0.0000
Bastrop	0.00	0.00	0.00	0.3333	0.0000	0.0000	0.0000	0.3333	0.3333	0.0000
Blanco	0.00	0.00	0.25	0.0000	0.2500	0.0000	0.0000	0.2500	0.0000	0.2500
Travis	0.00	0.00	0.00	0.1667	0.1667	0.1667	0.1667	0.0000	0.1667	0.1667
Williamson	0.00	0.00	0.00	0.0000	0.0000	0.3333	0.0000	0.3333	0.0000	0.3333
Burnet	0.00	0.00	0.00	0.0000	0.0000	0.0000	0.3333	0.3333	0.3333	0.0000
Note:
Each row sums to 1.0. w_ij = W_ij / sum_j(W_ij). Used as input for Moran’s I computation.

3.6.3 Neighbor Connectivity Summary

n_neighbors <- card(nb_queen)
connect_df  <- data.frame(
  County     = study_sf$CNTY_NM,
  N_Neighbors = n_neighbors,
  Neighbors  = sapply(seq_along(nb_queen), function(i) {
    paste(study_sf$CNTY_NM[nb_queen[[i]]], collapse = ", ")
  })
)

connect_df %>%
  kbl(
    caption = "Table 5: Queen Contiguity Neighbor Relationships (10-County Study Area)",
    col.names = c("County", "No. of Neighbors", "Contiguous Counties")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 12
  ) %>%
  column_spec(3, italic = TRUE)

Table 5: Queen Contiguity Neighbor Relationships (10-County Study Area)
County	No. of Neighbors	Contiguous Counties
Bexar	2	Guadalupe, Comal
Guadalupe	4	Bexar, Comal, Caldwell, Hays
Comal	4	Bexar, Guadalupe, Hays, Blanco
Caldwell	4	Guadalupe, Hays, Bastrop, Travis
Hays	5	Guadalupe, Comal, Caldwell, Blanco, Travis
Bastrop	3	Caldwell, Travis, Williamson
Blanco	4	Comal, Hays, Travis, Burnet
Travis	6	Caldwell, Hays, Bastrop, Blanco, Williamson, Burnet
Williamson	3	Bastrop, Travis, Burnet
Burnet	3	Blanco, Travis, Williamson

3.7 Step 2: Moran’s I Computation

Moran’s I is a global measure of spatial autocorrelation defined as:

\[I = \frac{n}{\sum_i \sum_j w_{ij}} \cdot \frac{\sum_i \sum_j w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}\]

where \(x_i\) is the observed value at location \(i\), \(\bar{x}\) is the global mean, \(w_{ij}\) is the spatial weight, and \(n\) is the number of observations. Under the null hypothesis of no spatial autocorrelation (complete spatial randomness):

\[E[I] = \frac{-1}{n-1}\]

Statistical significance is assessed via a permutation test (999 random permutations) and a normal approximation (randomization assumption).

# ---- Moran's I via permutation test (most robust for small n) -------------
set.seed(2025)
moran_perm <- moran.mc(
  x          = study_sf$AgeAdjstRatio,
  listw      = lw_queen,
  nsim       = 999,
  zero.policy = FALSE
)

# ---- Moran's I via analytical normal approximation -----------------------
moran_norm <- moran.test(
  x           = study_sf$AgeAdjstRatio,
  listw       = lw_queen,
  randomisation = TRUE,
  zero.policy  = FALSE
)

cat("=== Moran's I Results: Permutation Test (999 simulations) ===\n")

## === Moran's I Results: Permutation Test (999 simulations) ===

print(moran_perm)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  study_sf$AgeAdjstRatio 
## weights: lw_queen  
## number of simulations + 1: 1000 
## 
## statistic = 0.20746, observed rank = 941, p-value = 0.059
## alternative hypothesis: greater

cat("\n=== Moran's I Results: Normal Approximation ===\n")

## 
## === Moran's I Results: Normal Approximation ===

print(moran_norm)

## 
##  Moran I test under randomisation
## 
## data:  study_sf$AgeAdjstRatio  
## weights: lw_queen    
## 
## Moran I statistic standard deviate = 1.7593, p-value = 0.03927
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.20745842       -0.11111111        0.03279018

3.7.1 Compiled Moran’s I Results Table

n   <- nrow(study_sf)
E_I <- -1 / (n - 1)

morans_results <- data.frame(
  Parameter = c(
    "Observed Moran's I",
    "Expected I (H0: CSR)",
    "Variance (Randomization)",
    "Standard Deviate (Z)",
    "p-value (Permutation, 999 sim)",
    "p-value (Normal Approx.)",
    "Number of Counties (n)",
    "Total Spatial Links (W_ij sum)"
  ),
  Value = c(
    round(moran_perm$statistic, 6),
    round(E_I, 6),
    round(moran_norm$estimate[["Variance"]], 6),
    round(as.numeric(moran_norm$statistic), 4),
    round(moran_perm$p.value, 4),
    round(moran_norm$p.value, 4),
    n,
    sum(W_binary)
  )
)

morans_results %>%
  kbl(
    caption = "Table 6: Moran's I Results for AgeAdjstRatio (10-County Study Area)",
    col.names = c("Parameter", "Value"),
    align = c("l", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  row_spec(1, bold = TRUE, color = "white", background = "#2980b9") %>%
  row_spec(5:6, bold = TRUE,
           background = ifelse(moran_perm$p.value < 0.05, "#d5f5e3", "#fdebd0")) %>%
  footnote(general = paste0(
    "CSR = Complete Spatial Randomness. ",
    "Permutation test based on 999 random permutations (seed = 2025). ",
    "p < 0.05 indicates rejection of H0 at the 5% significance level."
  ))

Table 6: Moran’s I Results for AgeAdjstRatio (10-County Study Area)
Parameter	Value
Observed Moran’s I	0.207458
Expected I (H0: CSR)	-0.111111
Variance (Randomization)	0.032790
Standard Deviate (Z)	1.759300
p-value (Permutation, 999 sim)	0.059000
p-value (Normal Approx.)	0.039300
Number of Counties (n)	10.000000
Total Spatial Links (W_ij sum)	38.000000
Note:
CSR = Complete Spatial Randomness. Permutation test based on 999 random permutations (seed = 2025). p < 0.05 indicates rejection of H0 at the 5% significance level.

4 Results

4.1 Figure 1: Study Area Map with AgeAdjstRatio

# ---- Centroids for labeling -----------------------------------------------
centroids <- st_centroid(study_sf)
cent_coords <- st_coordinates(centroids)
study_sf$cx <- cent_coords[,1]
study_sf$cy <- cent_coords[,2]

# ---- Choropleth map -------------------------------------------------------
p_map <- ggplot(study_sf) +
  geom_sf(aes(fill = AgeAdjstRatio), color = "white", linewidth = 0.6) +
  geom_text(aes(x = cx, y = cy, label = CNTY_NM),
            size = 2.8, fontface = "bold", color = "black") +
  scale_fill_viridis_c(
    name   = "Age-Adj\nRatio\n(per 100k)",
    option = "YlOrRd",
    direction = 1,
    breaks = pretty(study_sf$AgeAdjstRatio, n = 5),
    labels = scales::label_number(accuracy = 0.1)
  ) +
  labs(
    title    = "10-County Central Texas Study Area",
    subtitle = "Age-Adjusted Late-Stage Colon Cancer Incidence Ratio (2005-2014)",
    caption  = "Source: GEO7301 Lab Dataset | CRS: EPSG 3082 (Texas Centric Albers)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    plot.caption  = element_text(size = 8, color = "#888888"),
    legend.position = "right",
    axis.text     = element_text(size = 8),
    panel.grid    = element_line(color = "grey90")
  )

print(p_map)

Figure 1: 10-County Central Texas Study Area - Age-Adjusted Late-Stage Colon Cancer Incidence Ratio (2005-2014)

ggsave(file.path(out_dir, "Fig1_StudyAreaMap.png"),
       p_map, width = 10, height = 7, dpi = 150)

4.2 Figure 2: Spatial Connectivity Graph

# ---- Build connectivity plot data ----------------------------------------
nb_coords <- st_coordinates(st_centroid(study_sf))
nb_lines  <- nb2lines(nb_queen, coords = nb_coords, as_sf = TRUE)
nb_lines  <- st_set_crs(nb_lines, st_crs(study_sf))

p_conn <- ggplot() +
  geom_sf(data = study_sf, fill = "#d6eaf8", color = "#2980b9", linewidth = 0.8) +
  geom_sf(data = nb_lines, color = "#e74c3c", linewidth = 1.0, alpha = 0.7) +
  geom_text(data = study_sf,
            aes(x = cx, y = cy, label = CNTY_NM),
            size = 3.0, fontface = "bold", color = "#2c3e50") +
  labs(
    title    = "Queen Contiguity Connectivity Graph",
    subtitle = "Red lines connect queen-contiguous county pairs",
    caption  = "Source: County.shp | CRS: EPSG 3082"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    plot.caption  = element_text(size = 8, color = "#888888"),
    panel.grid    = element_line(color = "grey92")
  )

print(p_conn)

Figure 2: Queen Contiguity Neighbor Graph for the 10-County Study Area

ggsave(file.path(out_dir, "Fig2_ConnectivityGraph.png"),
       p_conn, width = 10, height = 7, dpi = 150)

4.3 Figure 3: Moran’s I Scatter Plot (Moran Plot)

# ---- Moran scatter plot data ---------------------------------------------
z_score <- scale(study_sf$AgeAdjstRatio)[,1]
lag_z   <- lag.listw(lw_queen, z_score)

moran_scatter_df <- data.frame(
  County    = study_sf$CNTY_NM,
  z_AgeAdj  = z_score,
  lag_AgeAdj = lag_z,
  AgeAdjRaw = study_sf$AgeAdjstRatio
)

# ---- Quadrant labels ------------------------------------------------------
moran_scatter_df <- moran_scatter_df %>%
  mutate(
    Quadrant = case_when(
      z_AgeAdj >= 0 & lag_AgeAdj >= 0 ~ "HH (High-High)",
      z_AgeAdj <  0 & lag_AgeAdj <  0 ~ "LL (Low-Low)",
      z_AgeAdj >= 0 & lag_AgeAdj <  0 ~ "HL (High-Low)",
      z_AgeAdj <  0 & lag_AgeAdj >= 0 ~ "LH (Low-High)"
    )
  )

slope_val <- moran_perm$statistic

p_moran <- ggplot(moran_scatter_df, aes(x = z_AgeAdj, y = lag_AgeAdj)) +
  # Reference lines at origin
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.6) +
  # Regression line (slope = Moran's I)
  geom_smooth(method = "lm", se = TRUE, color = "#e74c3c",
              linewidth = 1.2, fill = "#fadbd8", alpha = 0.4) +
  # Points colored by quadrant
  geom_point(aes(color = Quadrant), size = 4, alpha = 0.85) +
  # County labels
  geom_text(aes(label = County), vjust = -0.8, size = 2.7,
            fontface = "bold", color = "#2c3e50") +
  scale_color_manual(
    values = c(
      "HH (High-High)" = "#c0392b",
      "LL (Low-Low)"   = "#2980b9",
      "HL (High-Low)"  = "#e67e22",
      "LH (Low-High)"  = "#27ae60"
    )
  ) +
  labs(
    title   = "Moran Scatter Plot: AgeAdjstRatio",
    subtitle = paste0("Moran's I = ", round(slope_val, 4),
                      " | p-value (perm.) = ", round(moran_perm$p.value, 4)),
    x       = "Standardized AgeAdjstRatio (z-score)",
    y       = "Spatial Lag of Standardized AgeAdjstRatio",
    color   = "Spatial Quadrant",
    caption = "Regression slope = Moran's I. Quadrants: HH=High-High, LL=Low-Low, HL=High-Low, LH=Low-High"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    legend.position = "bottom",
    panel.grid    = element_line(color = "grey92"),
    panel.border  = element_rect(color = "grey80", fill = NA)
  )

print(p_moran)

Figure 3: Moran Scatter Plot for AgeAdjstRatio - Spatial Lag vs. Standardized Variable

ggsave(file.path(out_dir, "Fig3_MoranScatterPlot.png"),
       p_moran, width = 10, height = 7, dpi = 150)

4.4 Figure 4: Permutation Distribution of Moran’s I

perm_df <- data.frame(I_perm = moran_perm$res[1:999])

obs_I <- moran_perm$statistic

p_perm <- ggplot(perm_df, aes(x = I_perm)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = "#aed6f1",
                 color = "white", alpha = 0.85) +
  geom_density(color = "#2980b9", linewidth = 1.0) +
  geom_vline(xintercept = obs_I, color = "#c0392b",
             linewidth = 1.4, linetype = "solid") +
  geom_vline(xintercept = E_I, color = "#27ae60",
             linewidth = 1.0, linetype = "dashed") +
  annotate("text", x = obs_I + 0.01, y = Inf, vjust = 1.5,
           label = paste0("Observed I\n= ", round(obs_I, 4)),
           color = "#c0392b", fontface = "bold", size = 3.5) +
  annotate("text", x = E_I - 0.01, y = Inf, vjust = 3.0,
           label = paste0("E[I] = ", round(E_I, 4)),
           color = "#27ae60", size = 3.2, hjust = 1) +
  labs(
    title    = "Permutation Distribution of Moran's I (999 Simulations)",
    subtitle = paste0("Observed I = ", round(obs_I, 4),
                      "  |  p-value = ", round(moran_perm$p.value, 4)),
    x        = "Moran's I (Simulated)",
    y        = "Density",
    caption  = "Red line = Observed Moran's I; Green dashed = Expected I under H0"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    panel.grid    = element_line(color = "grey92"),
    panel.border  = element_rect(color = "grey80", fill = NA)
  )

print(p_perm)

Figure 4: Permutation Distribution of Moran’s I (999 simulations) with Observed Statistic

ggsave(file.path(out_dir, "Fig4_PermutationDistribution.png"),
       p_perm, width = 10, height = 6, dpi = 150)

4.5 Figure 5: Bar Chart of AgeAdjstRatio by County

bar_df <- study_sf %>%
  st_drop_geometry() %>%
  arrange(AgeAdjstRatio) %>%
  mutate(CNTY_NM = factor(CNTY_NM, levels = CNTY_NM),
         color_grp = ifelse(AgeAdjstRatio > mean(AgeAdjstRatio), "Above Mean", "Below Mean"))

p_bar <- ggplot(bar_df, aes(x = CNTY_NM, y = AgeAdjstRatio, fill = color_grp)) +
  geom_col(color = "white", width = 0.7, alpha = 0.85) +
  geom_hline(yintercept = mean(bar_df$AgeAdjstRatio),
             linetype = "dashed", color = "#c0392b", linewidth = 1.0) +
  annotate("text",
           x    = 0.5,
           y    = mean(bar_df$AgeAdjstRatio) + 0.3,
           label = paste0("Mean = ", round(mean(bar_df$AgeAdjstRatio), 2)),
           color = "#c0392b", size = 3.5, hjust = 0) +
  geom_text(aes(label = round(AgeAdjstRatio, 1)), hjust = -0.1,
            size = 3.5, fontface = "bold", color = "#2c3e50") +
  scale_fill_manual(values = c("Above Mean" = "#c0392b", "Below Mean" = "#2980b9")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  coord_flip() +
  labs(
    title    = "Age-Adjusted Late-Stage Colon Cancer Incidence Ratio by County",
    subtitle = "10-County Central Texas Study Area (2005-2014)",
    x        = NULL,
    y        = "Age-Adjusted Ratio (per 100,000)",
    fill     = "Relative to Mean",
    caption  = "Red dashed line = Study area mean"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 13, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    legend.position = "bottom",
    panel.grid.major.y = element_blank(),
    axis.text.y   = element_text(face = "bold")
  )

print(p_bar)

Figure 5: Age-Adjusted Late-Stage Colon Cancer Incidence Ratio by County (ranked)

ggsave(file.path(out_dir, "Fig5_BarChart_AgeAdjRatio.png"),
       p_bar, width = 10, height = 6, dpi = 150)

4.6 Moran Scatter Plot Quadrant Classification

moran_scatter_df %>%
  select(County, AgeAdjRaw, z_AgeAdj, lag_AgeAdj, Quadrant) %>%
  arrange(Quadrant, desc(z_AgeAdj)) %>%
  kbl(
    caption   = "Table 7: Moran Scatter Plot Quadrant Classification of Study Counties",
    col.names = c("County", "AgeAdjRatio (raw)", "z-score", "Spatial Lag (z)", "LISA Quadrant"),
    digits    = 4,
    align     = c("l","r","r","r","l")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 12
  ) %>%
  footnote(general = paste0(
    "HH = High value surrounded by high values (positive clustering). ",
    "LL = Low value surrounded by low values (positive clustering). ",
    "HL = High value surrounded by low values (negative/dispersion). ",
    "LH = Low value surrounded by high values (negative/dispersion)."
  ))

Table 7: Moran Scatter Plot Quadrant Classification of Study Counties
County	AgeAdjRatio (raw)	z-score	Spatial Lag (z)	LISA Quadrant
Caldwell	23.5	1.8599	0.1013	HH (High-High)
Guadalupe	22.3	1.2204	0.4610	HH (High-High)
Bexar	21.1	0.5809	0.3677	HH (High-High)
Bastrop	20.4	0.2078	-0.0053	HL (High-Low)
Hays	19.8	-0.1119	0.0799	LH (Low-High)
Comal	19.1	-0.4850	0.1013	LH (Low-High)
Burnet	19.8	-0.1119	-1.0534	LL (Low-Low)
Travis	18.3	-0.9113	-0.0675	LL (Low-Low)
Williamson	18.2	-0.9646	-0.2718	LL (Low-Low)
Blanco	17.6	-1.2844	-0.4050	LL (Low-Low)
Note:
HH = High value surrounded by high values (positive clustering). LL = Low value surrounded by low values (positive clustering). HL = High value surrounded by low values (negative/dispersion). LH = Low value surrounded by high values (negative/dispersion).

5 Conclusion and Discussion

5.1 Spatial Weights Matrix Discussion

The queen contiguity spatial weights matrix \(W_{ij}\) was successfully constructed for the 10-county Central Texas study area. Key findings from the weights matrix include:

## Average number of neighbors: 3.8

## Range of neighbors: 2 to 6

## Most connected county: Travis ( 6 neighbors )

## Least connected county: Bexar ( 2 neighbor(s) )

The average number of queen-contiguous neighbors across the 10 counties is 3.8, indicating a moderately well-connected planar graph for this compact study area.
Travis has the highest neighbor count (6 neighbors), reflecting its central geographic position within the study area.
Bexar has the fewest contiguous neighbors (2), suggesting a more peripheral location within the study area boundary.
The choice of queen contiguity is appropriate for county-level health data where both shared edges and shared vertices represent meaningful spatial interaction pathways (e.g., patient travel patterns, shared healthcare facilities).
Row standardization ensures each county contributes equally to the spatial lag computation, which is the standard convention for computing Moran’s I.

5.2 Moran’s I Discussion

The observed Moran’s I = 0.2075 is positive relative to the expected value under complete spatial randomness (E[I] = -0.1111). The p-value from the permutation test (999 simulations) is 0.059, indicating that the result is not statistically significant at the alpha = 0.05 level.

5.2.1 Interpretation

Substantive interpretation: An I value of 0.2075 (compared to E[I] = -0.1111) indicates positive spatial autocorrelation, meaning counties with similar (either both high or both low) age-adjusted colon cancer late-stage incidence ratios tend to be located near each other. This spatial clustering could reflect shared healthcare access barriers, similar socioeconomic profiles, or shared exposure to risk factors within contiguous counties.

5.2.2 Moran’s I in Context

Scale effects: With only \(n = 10\) counties, the statistical power of Moran’s I is inherently limited. Small sample sizes increase the variance of the statistic, making it harder to reject the null hypothesis even when true spatial structure exists. This is a known limitation of applying global Moran’s I to small study areas.
Permutation vs. normal approximation: The permutation-based p-value is more reliable for small \(n\), as it does not assume normality of the underlying data. Both methods are reported for completeness.
LISA decomposition: The Moran scatter plot (Figure 3) reveals the quadrant assignments of individual counties. Counties in the HH quadrant (high-high) contribute most strongly to positive autocorrelation, while HL/LH counties represent spatial outliers that dampen the global statistic.
Health geography context: Spatial clustering in late-stage colon cancer incidence often reflects disparities in screening rates, healthcare access, poverty, and racial composition. Counties in the LL quadrant (low-low cluster: Travis, Hays, Williamson, Blanco) align with the well-resourced Austin Metro corridor where healthcare infrastructure is more accessible.
Policy implications: If spatial clustering is confirmed at finer geographic scales or across broader regions, spatially targeted interventions (e.g., mobile screening units, county-level Medicaid expansion outreach) would be more efficient than uniform statewide policies.

6 References

Anselin, L. (1995). Local indicators of spatial association – LISA. Geographical Analysis, 27(2), 93-115.

Anselin, L. (2019). A Local Indicator of Multivariate Spatial Association: Extending Geary’s C. Geographical Analysis, 51(2), 133-150.

Bivand, R. S., Pebesma, E., and Gomez-Rubio, V. (2013). Applied Spatial Data Analysis with R, 2nd edition. Springer.

Getis, A. and Ord, J. K. (1992). The analysis of spatial association by use of distance statistics. Geographical Analysis, 24(3), 189-206.

Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17-23.

Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439-446.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

## === Session Information ===

## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.3.0       gridExtra_2.3      viridis_0.6.5      viridisLite_0.4.2 
##  [5] RColorBrewer_1.1-3 kableExtra_1.4.0   knitr_1.50         ggplot2_3.5.1     
##  [9] dplyr_1.1.4        readxl_1.4.5       spdep_1.3-10       spData_2.3.4      
## [13] sf_1.0-16         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1   farver_2.1.2       fastmap_1.2.0      TH.data_1.1-3     
##  [5] digest_0.6.35      lifecycle_1.0.4    LearnBayes_2.15.1  survival_3.8-3    
##  [9] magrittr_2.0.3     compiler_4.3.3     rlang_1.1.3        sass_0.4.9        
## [13] tools_4.3.3        igraph_2.1.4       yaml_2.3.10        labeling_0.4.3    
## [17] sp_2.2-0           classInt_0.4-11    xml2_1.3.8         multcomp_1.4-28   
## [21] KernSmooth_2.23-26 withr_3.0.2        grid_4.3.3         e1071_1.7-16      
## [25] colorspace_2.1-1   MASS_7.3-60.0.1    cli_3.6.2          mvtnorm_1.3-3     
## [29] rmarkdown_2.29     ragg_1.3.3         generics_0.1.3     rstudioapi_0.17.1 
## [33] DBI_1.2.3          cachem_1.1.0       proxy_0.4-27       stringr_1.5.1     
## [37] splines_4.3.3      spatialreg_1.3-6   s2_1.1.7           cellranger_1.1.0  
## [41] vctrs_0.6.5        boot_1.3-31        Matrix_1.6-5       sandwich_3.1-1    
## [45] jsonlite_2.0.0     systemfonts_1.2.1  jquerylib_0.1.4    units_0.8-7       
## [49] glue_1.8.0         codetools_0.2-20   stringi_1.8.7      gtable_0.3.6      
## [53] deldir_2.0-4       munsell_0.5.1      tibble_3.2.1       pillar_1.10.1     
## [57] htmltools_0.5.8.1  R6_2.6.1           wk_0.9.4           textshaping_1.0.0 
## [61] evaluate_1.0.3     lattice_0.22-6     bslib_0.9.0        class_7.3-23      
## [65] Rcpp_1.0.12        svglite_2.1.3      coda_0.19-4.1      nlme_3.1-168      
## [69] mgcv_1.9-1         xfun_0.51          zoo_1.8-13         pkgconfig_2.0.3

## All outputs saved to: C:/GEOPLATFORM/GEO7301_ADQUAN/R_Work/Auto Correlation

## Files saved:

##  - Fig1_StudyAreaMap.png
##  - Fig2_ConnectivityGraph.png
##  - Fig3_MoranScatterPlot.png
##  - Fig4_PermutationDistribution.png
##  - Fig5_BarChart_AgeAdjRatio.png
##  - Moran_Scatter_Quadrants.csv
##  - MoransI_Results_Table.csv
##  - Neighbor_Connectivity_Table.csv
##  - StudyArea_AttributeTable.csv
##  - W_Binary_Matrix.csv
##  - W_RowStd_Matrix.csv

Report generated using R 4.3.3 | spdep 1.3.10 | sf 1.0.16

Output directory: C:/GEOPLATFORM/GEO7301_ADQUAN/R_Work/Auto Correlation

Advanced Quantitative Methods in Geography

Spatial Weights Matrix and Moran’s I Analysis

Onyedikachi Joshua Okeke

April 13, 2026