1 Summary

This lab report documents a spatial autocorrelation analysis conducted on the Age-Adjusted Colon Cancer Late-Stage Incidence Ratio (AgeAdjstRatio) for a 10-county Central Texas study area. The analysis addresses two core tasks: (1) construction and display of a spatial weights matrix \(W_{ij}\) based on queen contiguity adjacency, and (2) computation and significance testing of Moran’s I statistic.

The 10-county study area encompasses the Austin-San Marcos metropolitan corridor and surrounding counties: Travis, Hays, Williamson, Bastrop, Caldwell, Comal, Guadalupe, Bexar, Blanco, and Burnet. This cluster represents a contiguous geographic unit appropriate for spatial dependence analysis. Results indicate the presence (or absence) of spatial clustering in late-stage colon cancer incidence across this region, with implications for health geography research and policy targeting.


2 Description of Data Used

2.1 Spatial Data

The spatial data consist of a polygon shapefile (County.shp) representing all 254 Texas counties. Key attribute fields used in this analysis include:

Field Description
CNTY_NM County name
CNTY_FIPS County FIPS code (character)
geometry Polygon geometry (EPSG:4326)

The shapefile is sourced from a Texas county boundary dataset and was reprojected to EPSG:3082 (Texas Centric Albers Equal Area) for accurate adjacency calculation.

2.2 Tabular Data

The tabular dataset (GEO7301_Labs3to5_CountyVariablesColonLateStage2005to2014.xlsx) contains 245 Texas counties with 29 variables including:

Variable Description
NAME County name
FIPS 5-digit FIPS code (integer)
AgeAdjstRatio Age-adjusted late-stage colon cancer incidence ratio (per 100,000)
PopAtRisk Population at risk
POPULATION Total county population
Various socioeconomic Race/ethnicity, housing, family size, etc.

The variable of primary interest is AgeAdjstRatio, representing age-adjusted late-stage colon cancer incidence for the period 2005-2014. This variable is appropriate for spatial autocorrelation analysis given its health-geographic interpretive value and potential for spatial clustering driven by shared environmental, demographic, or healthcare access factors.


3 Methods and Analysis Procedures

3.1 R Packages

# Install packages if not already installed
pkgs <- c("sf", "spdep", "readxl", "dplyr", "ggplot2",
          "knitr", "kableExtra", "RColorBrewer", "viridis",
          "gridExtra", "scales")

for (p in pkgs) {
  if (!requireNamespace(p, quietly = TRUE)) install.packages(p)
}

library(sf)
library(spdep)
library(readxl)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(RColorBrewer)
library(viridis)
library(gridExtra)
library(scales)

cat("All packages loaded successfully.\n")
## All packages loaded successfully.
cat("R version:", as.character(getRversion()), "\n")
## R version: 4.3.3
cat("sf version:", as.character(packageVersion("sf")), "\n")
## sf version: 1.0.16
cat("spdep version:", as.character(packageVersion("spdep")), "\n")
## spdep version: 1.3.10

3.2 Data Loading

# ---- Paths ----------------------------------------------------------------
shp_path  <- file.path(out_dir, "County.shp")
xlsx_path <- file.path(out_dir,
  "GEO7301_Labs3to5_CountyVariablesColonLateStage2005to2014.xlsx")

# Copy source files to output directory if not already there
if (!file.exists(shp_path)) {
  # Adjust these source paths to match your local setup
  src_shp   <- list.files(".", pattern = "County\\.shp$", recursive = TRUE,
                           full.names = TRUE)[1]
  src_xlsx  <- list.files(".", pattern = "\\.xlsx$", recursive = TRUE,
                           full.names = TRUE)[1]
  shp_files <- list.files(dirname(src_shp),
                           pattern = "County\\.(shp|dbf|prj|shx|cpg)$",
                           full.names = TRUE)
  file.copy(shp_files, out_dir, overwrite = TRUE)
  file.copy(src_xlsx, out_dir, overwrite = TRUE)
}

# ---- Load shapefile -------------------------------------------------------
tx_counties <- st_read(shp_path, quiet = TRUE)
cat("Shapefile CRS:", st_crs(tx_counties)$epsg, "\n")
## Shapefile CRS: 4326
cat("Number of counties in shapefile:", nrow(tx_counties), "\n")
## Number of counties in shapefile: 254
# ---- Load Excel table -----------------------------------------------------
county_data <- read_excel(xlsx_path)
cat("Rows in Excel table:", nrow(county_data), "\n")
## Rows in Excel table: 245
cat("Variables:", ncol(county_data), "\n")
## Variables: 29

3.3 Defining the 10-County Study Area

The study area is defined as 10 contiguous Central Texas counties forming the Austin-San Marcos corridor and its periphery. This selection reflects geographic contiguity and regional relevance for public health spatial analysis.

# ---- Define the 10-county study area ------------------------------------
study_counties <- c(
  "Travis",     # Austin core
  "Hays",       # San Marcos / Wimberley
  "Williamson", # Round Rock / Georgetown
  "Bastrop",    # Bastrop
  "Caldwell",   # Lockhart
  "Comal",      # New Braunfels
  "Guadalupe",  # Seguin
  "Bexar",      # San Antonio
  "Blanco",     # Johnson City
  "Burnet"      # Burnet
)

cat("Study counties:\n")
## Study counties:
cat(paste(" -", study_counties, collapse = "\n"), "\n")
##  - Travis
##  - Hays
##  - Williamson
##  - Bastrop
##  - Caldwell
##  - Comal
##  - Guadalupe
##  - Bexar
##  - Blanco
##  - Burnet

3.4 Data Preparation and Joining

# ---- Reproject to Texas Centric Albers Equal Area (EPSG:3082) -----------
tx_counties_proj <- st_transform(tx_counties, crs = 3082)

# ---- Subset to study area ------------------------------------------------
study_shp <- tx_counties_proj %>%
  filter(CNTY_NM %in% study_counties)

cat("Counties retained in spatial subset:", nrow(study_shp), "\n")
## Counties retained in spatial subset: 10
# ---- Prepare FIPS key for join -------------------------------------------
# Shapefile FIPS is character (e.g., "48453"); Excel FIPS is integer (48453)
study_shp <- study_shp %>%
  mutate(FIPS_key = as.integer(CNTY_FIPS))

# ---- Join tabular data to spatial layer ----------------------------------
study_data <- county_data %>%
  filter(NAME %in% study_counties) %>%
  select(NAME, FIPS, AgeAdjstRatio, POPULATION, PopAtRisk,
         POP_SQMI, BLACK, HISPANIC, RENTER_OCC, VACANT)

study_sf <- study_shp %>%
  left_join(study_data, by = c("FIPS_key" = "FIPS"))

cat("\nJoined dataset dimensions:", nrow(study_sf), "rows x",
    ncol(study_sf), "cols\n")
## 
## Joined dataset dimensions: 10 rows x 26 cols
cat("NA in AgeAdjstRatio:", sum(is.na(study_sf$AgeAdjstRatio)), "\n")
## NA in AgeAdjstRatio: 0

3.5 Summary Statistics for Study Area

summary_df <- study_sf %>%
  st_drop_geometry() %>%
  select(NAME, AgeAdjstRatio, POPULATION, PopAtRisk, POP_SQMI) %>%
  arrange(desc(AgeAdjstRatio))

summary_df %>%
  kbl(
    caption = "Table 1: Descriptive Summary of Study Area Counties",
    digits  = 2,
    col.names = c("County", "Age-Adj Ratio", "Population", "Pop at Risk",
                  "Pop/sq mi"),
    align   = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 12
  ) %>%
  column_spec(2, bold = TRUE, color = "white",
              background = spec_color(summary_df$AgeAdjstRatio,
                                      palette = "YlOrRd",
                                      end = 0.85)) %>%
  add_header_above(c(" " = 1, "Health Variable" = 1, "Demographic Context" = 3)) %>%
  footnote(general = "AgeAdjstRatio = Age-adjusted late-stage colon cancer incidence ratio per 100,000 population, 2005-2014.")
Table 1: Descriptive Summary of Study Area Counties
Health Variable
Demographic Context
County Age-Adj Ratio Population Pop at Risk Pop/sq mi
Caldwell 23.5 41606 379646 76.1
Guadalupe 22.3 152994 1285404 214.0
Bexar 21.1 1897710 17008499 1511.0
Bastrop 20.4 82757 733819 92.4
Hays 19.8 199764 1550533 294.0
Burnet 19.8 47418 426042 46.4
Comal 19.1 131083 1074227 227.9
Travis 18.3 1166555 10221115 1140.0
Williamson 18.2 506367 4150032 446.3
Blanco 17.6 11228 102567 15.7
Note:
AgeAdjstRatio = Age-adjusted late-stage colon cancer incidence ratio per 100,000 population, 2005-2014.
cat("=== Descriptive Statistics: AgeAdjstRatio ===\n")
## === Descriptive Statistics: AgeAdjstRatio ===
x <- study_sf$AgeAdjstRatio
stats_df <- data.frame(
  Statistic = c("N", "Min", "Max", "Mean", "Median", "SD", "CV (%)"),
  Value     = c(
    length(x),
    round(min(x), 2),
    round(max(x), 2),
    round(mean(x), 2),
    round(median(x), 2),
    round(sd(x), 3),
    round(sd(x)/mean(x)*100, 2)
  )
)

stats_df %>%
  kbl(caption = "Table 2: Descriptive Statistics for AgeAdjstRatio",
      align = c("l","r")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, position = "center")
Table 2: Descriptive Statistics for AgeAdjstRatio
Statistic Value
N 10.000
Min 17.600
Max 23.500
Mean 20.010
Median 19.800
SD 1.876
CV (%) 9.380

3.6 Step 1: Spatial Weights Matrix Construction

The spatial weights matrix \(W_{ij}\) is constructed using queen contiguity (shared boundary or vertex). For \(n = 10\) counties, the binary adjacency matrix \(W\) is defined as:

\[W_{ij} = \begin{cases} 1 & \text{if counties } i \text{ and } j \text{ share a boundary or vertex} \\ 0 & \text{otherwise} \end{cases}\]

Row standardization transforms \(W\) so that each row sums to 1, yielding the row-stochastic matrix \(W^{(RS)}\), which is the standard form used in Moran’s I computation:

\[w_{ij}^{(RS)} = \frac{W_{ij}}{\sum_j W_{ij}}\]

# ---- Build queen contiguity neighbors list --------------------------------
nb_queen <- poly2nb(study_sf, queen = TRUE)

cat("=== Neighbor Summary (Queen Contiguity) ===\n")
## === Neighbor Summary (Queen Contiguity) ===
print(summary(nb_queen))
## Neighbour list object:
## Number of regions: 10 
## Number of nonzero links: 38 
## Percentage nonzero weights: 38 
## Average number of links: 3.8 
## Link number distribution:
## 
## 2 3 4 5 6 
## 1 3 4 1 1 
## 1 least connected region:
## 1 with 2 links
## 1 most connected region:
## 8 with 6 links
# ---- Assign region names for readability ----------------------------------
attr(nb_queen, "region.id") <- study_sf$CNTY_NM

# ---- Row-standardized spatial weights object ------------------------------
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = FALSE)

cat("\n=== Spatial Weights List Summary ===\n")
## 
## === Spatial Weights List Summary ===
print(summary(lw_queen))
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 10 
## Number of nonzero links: 38 
## Percentage nonzero weights: 38 
## Average number of links: 3.8 
## Link number distribution:
## 
## 2 3 4 5 6 
## 1 3 4 1 1 
## 1 least connected region:
## Bexar with 2 links
## 1 most connected region:
## Travis with 6 links
## 
## Weights style: W 
## Weights constants summary:
##    n  nn S0       S1       S2
## W 10 100 10 5.486111 41.00556

3.6.1 Display the Full Binary Adjacency Matrix

# ---- Convert to full binary matrix ----------------------------------------
W_binary <- nb2mat(nb_queen, style = "B", zero.policy = FALSE)
rownames(W_binary) <- study_sf$CNTY_NM
colnames(W_binary) <- study_sf$CNTY_NM

cat("=== Binary Spatial Weights Matrix W_ij (Queen Contiguity) ===\n\n")
## === Binary Spatial Weights Matrix W_ij (Queen Contiguity) ===
W_display <- as.data.frame(W_binary)
print(W_display)
##            Bexar Guadalupe Comal Caldwell Hays Bastrop Blanco Travis Williamson
## Bexar          0         1     1        0    0       0      0      0          0
## Guadalupe      1         0     1        1    1       0      0      0          0
## Comal          1         1     0        0    1       0      1      0          0
## Caldwell       0         1     0        0    1       1      0      1          0
## Hays           0         1     1        1    0       0      1      1          0
## Bastrop        0         0     0        1    0       0      0      1          1
## Blanco         0         0     1        0    1       0      0      1          0
## Travis         0         0     0        1    1       1      1      0          1
## Williamson     0         0     0        0    0       1      0      1          0
## Burnet         0         0     0        0    0       0      1      1          1
##            Burnet
## Bexar           0
## Guadalupe       0
## Comal           0
## Caldwell        0
## Hays            0
## Bastrop         0
## Blanco          1
## Travis          1
## Williamson      1
## Burnet          0
W_binary %>%
  as.data.frame() %>%
  kbl(
    caption = "Table 3: Binary Spatial Weights Matrix W_ij (Queen Contiguity, n = 10 Counties)",
    digits  = 0
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width        = TRUE,
    font_size         = 11
  ) %>%
  column_spec(1:11, width = "1.6cm") %>%
  footnote(general = paste0(
    "W_ij = 1 if counties i and j are queen-contiguous; 0 otherwise. ",
    "Diagonal entries are 0 by convention. ",
    "Column names abbreviated to county names."
  ))
Table 3: Binary Spatial Weights Matrix W_ij (Queen Contiguity, n = 10 Counties)
Bexar Guadalupe Comal Caldwell Hays Bastrop Blanco Travis Williamson Burnet
Bexar 0 1 1 0 0 0 0 0 0 0
Guadalupe 1 0 1 1 1 0 0 0 0 0
Comal 1 1 0 0 1 0 1 0 0 0
Caldwell 0 1 0 0 1 1 0 1 0 0
Hays 0 1 1 1 0 0 1 1 0 0
Bastrop 0 0 0 1 0 0 0 1 1 0
Blanco 0 0 1 0 1 0 0 1 0 1
Travis 0 0 0 1 1 1 1 0 1 1
Williamson 0 0 0 0 0 1 0 1 0 1
Burnet 0 0 0 0 0 0 1 1 1 0
Note:
W_ij = 1 if counties i and j are queen-contiguous; 0 otherwise. Diagonal entries are 0 by convention. Column names abbreviated to county names.

3.6.2 Display the Row-Standardized Weights Matrix

# ---- Row-standardized matrix ----------------------------------------------
W_rowstd <- nb2mat(nb_queen, style = "W", zero.policy = FALSE)
rownames(W_rowstd) <- study_sf$CNTY_NM
colnames(W_rowstd) <- study_sf$CNTY_NM

cat("=== Row-Standardized Spatial Weights Matrix W_ij^(RS) ===\n\n")
## === Row-Standardized Spatial Weights Matrix W_ij^(RS) ===
print(round(W_rowstd, 4))
##            Bexar Guadalupe Comal Caldwell   Hays Bastrop Blanco Travis
## Bexar       0.00      0.50  0.50   0.0000 0.0000  0.0000 0.0000 0.0000
## Guadalupe   0.25      0.00  0.25   0.2500 0.2500  0.0000 0.0000 0.0000
## Comal       0.25      0.25  0.00   0.0000 0.2500  0.0000 0.2500 0.0000
## Caldwell    0.00      0.25  0.00   0.0000 0.2500  0.2500 0.0000 0.2500
## Hays        0.00      0.20  0.20   0.2000 0.0000  0.0000 0.2000 0.2000
## Bastrop     0.00      0.00  0.00   0.3333 0.0000  0.0000 0.0000 0.3333
## Blanco      0.00      0.00  0.25   0.0000 0.2500  0.0000 0.0000 0.2500
## Travis      0.00      0.00  0.00   0.1667 0.1667  0.1667 0.1667 0.0000
## Williamson  0.00      0.00  0.00   0.0000 0.0000  0.3333 0.0000 0.3333
## Burnet      0.00      0.00  0.00   0.0000 0.0000  0.0000 0.3333 0.3333
##            Williamson Burnet
## Bexar          0.0000 0.0000
## Guadalupe      0.0000 0.0000
## Comal          0.0000 0.0000
## Caldwell       0.0000 0.0000
## Hays           0.0000 0.0000
## Bastrop        0.3333 0.0000
## Blanco         0.0000 0.2500
## Travis         0.1667 0.1667
## Williamson     0.0000 0.3333
## Burnet         0.3333 0.0000
## attr(,"call")
## nb2mat(neighbours = nb_queen, style = "W", zero.policy = FALSE)
round(W_rowstd, 4) %>%
  as.data.frame() %>%
  kbl(
    caption = "Table 4: Row-Standardized Spatial Weights Matrix W_ij (style = 'W')",
    digits  = 4
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 11
  ) %>%
  footnote(general = paste0(
    "Each row sums to 1.0. ",
    "w_ij = W_ij / sum_j(W_ij). ",
    "Used as input for Moran's I computation."
  ))
Table 4: Row-Standardized Spatial Weights Matrix W_ij (style = ‘W’)
Bexar Guadalupe Comal Caldwell Hays Bastrop Blanco Travis Williamson Burnet
Bexar 0.00 0.50 0.50 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Guadalupe 0.25 0.00 0.25 0.2500 0.2500 0.0000 0.0000 0.0000 0.0000 0.0000
Comal 0.25 0.25 0.00 0.0000 0.2500 0.0000 0.2500 0.0000 0.0000 0.0000
Caldwell 0.00 0.25 0.00 0.0000 0.2500 0.2500 0.0000 0.2500 0.0000 0.0000
Hays 0.00 0.20 0.20 0.2000 0.0000 0.0000 0.2000 0.2000 0.0000 0.0000
Bastrop 0.00 0.00 0.00 0.3333 0.0000 0.0000 0.0000 0.3333 0.3333 0.0000
Blanco 0.00 0.00 0.25 0.0000 0.2500 0.0000 0.0000 0.2500 0.0000 0.2500
Travis 0.00 0.00 0.00 0.1667 0.1667 0.1667 0.1667 0.0000 0.1667 0.1667
Williamson 0.00 0.00 0.00 0.0000 0.0000 0.3333 0.0000 0.3333 0.0000 0.3333
Burnet 0.00 0.00 0.00 0.0000 0.0000 0.0000 0.3333 0.3333 0.3333 0.0000
Note:
Each row sums to 1.0. w_ij = W_ij / sum_j(W_ij). Used as input for Moran’s I computation.

3.6.3 Neighbor Connectivity Summary

n_neighbors <- card(nb_queen)
connect_df  <- data.frame(
  County     = study_sf$CNTY_NM,
  N_Neighbors = n_neighbors,
  Neighbors  = sapply(seq_along(nb_queen), function(i) {
    paste(study_sf$CNTY_NM[nb_queen[[i]]], collapse = ", ")
  })
)

connect_df %>%
  kbl(
    caption = "Table 5: Queen Contiguity Neighbor Relationships (10-County Study Area)",
    col.names = c("County", "No. of Neighbors", "Contiguous Counties")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 12
  ) %>%
  column_spec(3, italic = TRUE)
Table 5: Queen Contiguity Neighbor Relationships (10-County Study Area)
County No. of Neighbors Contiguous Counties
Bexar 2 Guadalupe, Comal
Guadalupe 4 Bexar, Comal, Caldwell, Hays
Comal 4 Bexar, Guadalupe, Hays, Blanco
Caldwell 4 Guadalupe, Hays, Bastrop, Travis
Hays 5 Guadalupe, Comal, Caldwell, Blanco, Travis
Bastrop 3 Caldwell, Travis, Williamson
Blanco 4 Comal, Hays, Travis, Burnet
Travis 6 Caldwell, Hays, Bastrop, Blanco, Williamson, Burnet
Williamson 3 Bastrop, Travis, Burnet
Burnet 3 Blanco, Travis, Williamson

3.7 Step 2: Moran’s I Computation

Moran’s I is a global measure of spatial autocorrelation defined as:

\[I = \frac{n}{\sum_i \sum_j w_{ij}} \cdot \frac{\sum_i \sum_j w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}\]

where \(x_i\) is the observed value at location \(i\), \(\bar{x}\) is the global mean, \(w_{ij}\) is the spatial weight, and \(n\) is the number of observations. Under the null hypothesis of no spatial autocorrelation (complete spatial randomness):

\[E[I] = \frac{-1}{n-1}\]

Statistical significance is assessed via a permutation test (999 random permutations) and a normal approximation (randomization assumption).

# ---- Moran's I via permutation test (most robust for small n) -------------
set.seed(2025)
moran_perm <- moran.mc(
  x          = study_sf$AgeAdjstRatio,
  listw      = lw_queen,
  nsim       = 999,
  zero.policy = FALSE
)

# ---- Moran's I via analytical normal approximation -----------------------
moran_norm <- moran.test(
  x           = study_sf$AgeAdjstRatio,
  listw       = lw_queen,
  randomisation = TRUE,
  zero.policy  = FALSE
)

cat("=== Moran's I Results: Permutation Test (999 simulations) ===\n")
## === Moran's I Results: Permutation Test (999 simulations) ===
print(moran_perm)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  study_sf$AgeAdjstRatio 
## weights: lw_queen  
## number of simulations + 1: 1000 
## 
## statistic = 0.20746, observed rank = 941, p-value = 0.059
## alternative hypothesis: greater
cat("\n=== Moran's I Results: Normal Approximation ===\n")
## 
## === Moran's I Results: Normal Approximation ===
print(moran_norm)
## 
##  Moran I test under randomisation
## 
## data:  study_sf$AgeAdjstRatio  
## weights: lw_queen    
## 
## Moran I statistic standard deviate = 1.7593, p-value = 0.03927
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.20745842       -0.11111111        0.03279018

3.7.1 Compiled Moran’s I Results Table

n   <- nrow(study_sf)
E_I <- -1 / (n - 1)

morans_results <- data.frame(
  Parameter = c(
    "Observed Moran's I",
    "Expected I (H0: CSR)",
    "Variance (Randomization)",
    "Standard Deviate (Z)",
    "p-value (Permutation, 999 sim)",
    "p-value (Normal Approx.)",
    "Number of Counties (n)",
    "Total Spatial Links (W_ij sum)"
  ),
  Value = c(
    round(moran_perm$statistic, 6),
    round(E_I, 6),
    round(moran_norm$estimate[["Variance"]], 6),
    round(as.numeric(moran_norm$statistic), 4),
    round(moran_perm$p.value, 4),
    round(moran_norm$p.value, 4),
    n,
    sum(W_binary)
  )
)

morans_results %>%
  kbl(
    caption = "Table 6: Moran's I Results for AgeAdjstRatio (10-County Study Area)",
    col.names = c("Parameter", "Value"),
    align = c("l", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 13
  ) %>%
  row_spec(1, bold = TRUE, color = "white", background = "#2980b9") %>%
  row_spec(5:6, bold = TRUE,
           background = ifelse(moran_perm$p.value < 0.05, "#d5f5e3", "#fdebd0")) %>%
  footnote(general = paste0(
    "CSR = Complete Spatial Randomness. ",
    "Permutation test based on 999 random permutations (seed = 2025). ",
    "p < 0.05 indicates rejection of H0 at the 5% significance level."
  ))
Table 6: Moran’s I Results for AgeAdjstRatio (10-County Study Area)
Parameter Value
Observed Moran’s I 0.207458
Expected I (H0: CSR) -0.111111
Variance (Randomization) 0.032790
Standard Deviate (Z) 1.759300
p-value (Permutation, 999 sim) 0.059000
p-value (Normal Approx.) 0.039300
Number of Counties (n) 10.000000
Total Spatial Links (W_ij sum) 38.000000
Note:
CSR = Complete Spatial Randomness. Permutation test based on 999 random permutations (seed = 2025). p < 0.05 indicates rejection of H0 at the 5% significance level.

4 Results

4.1 Figure 1: Study Area Map with AgeAdjstRatio

# ---- Centroids for labeling -----------------------------------------------
centroids <- st_centroid(study_sf)
cent_coords <- st_coordinates(centroids)
study_sf$cx <- cent_coords[,1]
study_sf$cy <- cent_coords[,2]

# ---- Choropleth map -------------------------------------------------------
p_map <- ggplot(study_sf) +
  geom_sf(aes(fill = AgeAdjstRatio), color = "white", linewidth = 0.6) +
  geom_text(aes(x = cx, y = cy, label = CNTY_NM),
            size = 2.8, fontface = "bold", color = "black") +
  scale_fill_viridis_c(
    name   = "Age-Adj\nRatio\n(per 100k)",
    option = "YlOrRd",
    direction = 1,
    breaks = pretty(study_sf$AgeAdjstRatio, n = 5),
    labels = scales::label_number(accuracy = 0.1)
  ) +
  labs(
    title    = "10-County Central Texas Study Area",
    subtitle = "Age-Adjusted Late-Stage Colon Cancer Incidence Ratio (2005-2014)",
    caption  = "Source: GEO7301 Lab Dataset | CRS: EPSG 3082 (Texas Centric Albers)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    plot.caption  = element_text(size = 8, color = "#888888"),
    legend.position = "right",
    axis.text     = element_text(size = 8),
    panel.grid    = element_line(color = "grey90")
  )

print(p_map)
Figure 1: 10-County Central Texas Study Area - Age-Adjusted Late-Stage Colon Cancer Incidence Ratio (2005-2014)

Figure 1: 10-County Central Texas Study Area - Age-Adjusted Late-Stage Colon Cancer Incidence Ratio (2005-2014)

ggsave(file.path(out_dir, "Fig1_StudyAreaMap.png"),
       p_map, width = 10, height = 7, dpi = 150)

4.2 Figure 2: Spatial Connectivity Graph

# ---- Build connectivity plot data ----------------------------------------
nb_coords <- st_coordinates(st_centroid(study_sf))
nb_lines  <- nb2lines(nb_queen, coords = nb_coords, as_sf = TRUE)
nb_lines  <- st_set_crs(nb_lines, st_crs(study_sf))

p_conn <- ggplot() +
  geom_sf(data = study_sf, fill = "#d6eaf8", color = "#2980b9", linewidth = 0.8) +
  geom_sf(data = nb_lines, color = "#e74c3c", linewidth = 1.0, alpha = 0.7) +
  geom_text(data = study_sf,
            aes(x = cx, y = cy, label = CNTY_NM),
            size = 3.0, fontface = "bold", color = "#2c3e50") +
  labs(
    title    = "Queen Contiguity Connectivity Graph",
    subtitle = "Red lines connect queen-contiguous county pairs",
    caption  = "Source: County.shp | CRS: EPSG 3082"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    plot.caption  = element_text(size = 8, color = "#888888"),
    panel.grid    = element_line(color = "grey92")
  )

print(p_conn)
Figure 2: Queen Contiguity Neighbor Graph for the 10-County Study Area

Figure 2: Queen Contiguity Neighbor Graph for the 10-County Study Area

ggsave(file.path(out_dir, "Fig2_ConnectivityGraph.png"),
       p_conn, width = 10, height = 7, dpi = 150)

4.3 Figure 3: Moran’s I Scatter Plot (Moran Plot)

# ---- Moran scatter plot data ---------------------------------------------
z_score <- scale(study_sf$AgeAdjstRatio)[,1]
lag_z   <- lag.listw(lw_queen, z_score)

moran_scatter_df <- data.frame(
  County    = study_sf$CNTY_NM,
  z_AgeAdj  = z_score,
  lag_AgeAdj = lag_z,
  AgeAdjRaw = study_sf$AgeAdjstRatio
)

# ---- Quadrant labels ------------------------------------------------------
moran_scatter_df <- moran_scatter_df %>%
  mutate(
    Quadrant = case_when(
      z_AgeAdj >= 0 & lag_AgeAdj >= 0 ~ "HH (High-High)",
      z_AgeAdj <  0 & lag_AgeAdj <  0 ~ "LL (Low-Low)",
      z_AgeAdj >= 0 & lag_AgeAdj <  0 ~ "HL (High-Low)",
      z_AgeAdj <  0 & lag_AgeAdj >= 0 ~ "LH (Low-High)"
    )
  )

slope_val <- moran_perm$statistic

p_moran <- ggplot(moran_scatter_df, aes(x = z_AgeAdj, y = lag_AgeAdj)) +
  # Reference lines at origin
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50", linewidth = 0.6) +
  # Regression line (slope = Moran's I)
  geom_smooth(method = "lm", se = TRUE, color = "#e74c3c",
              linewidth = 1.2, fill = "#fadbd8", alpha = 0.4) +
  # Points colored by quadrant
  geom_point(aes(color = Quadrant), size = 4, alpha = 0.85) +
  # County labels
  geom_text(aes(label = County), vjust = -0.8, size = 2.7,
            fontface = "bold", color = "#2c3e50") +
  scale_color_manual(
    values = c(
      "HH (High-High)" = "#c0392b",
      "LL (Low-Low)"   = "#2980b9",
      "HL (High-Low)"  = "#e67e22",
      "LH (Low-High)"  = "#27ae60"
    )
  ) +
  labs(
    title   = "Moran Scatter Plot: AgeAdjstRatio",
    subtitle = paste0("Moran's I = ", round(slope_val, 4),
                      " | p-value (perm.) = ", round(moran_perm$p.value, 4)),
    x       = "Standardized AgeAdjstRatio (z-score)",
    y       = "Spatial Lag of Standardized AgeAdjstRatio",
    color   = "Spatial Quadrant",
    caption = "Regression slope = Moran's I. Quadrants: HH=High-High, LL=Low-Low, HL=High-Low, LH=Low-High"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    legend.position = "bottom",
    panel.grid    = element_line(color = "grey92"),
    panel.border  = element_rect(color = "grey80", fill = NA)
  )

print(p_moran)
Figure 3: Moran Scatter Plot for AgeAdjstRatio - Spatial Lag vs. Standardized Variable

Figure 3: Moran Scatter Plot for AgeAdjstRatio - Spatial Lag vs. Standardized Variable

ggsave(file.path(out_dir, "Fig3_MoranScatterPlot.png"),
       p_moran, width = 10, height = 7, dpi = 150)

4.4 Figure 4: Permutation Distribution of Moran’s I

perm_df <- data.frame(I_perm = moran_perm$res[1:999])

obs_I <- moran_perm$statistic

p_perm <- ggplot(perm_df, aes(x = I_perm)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = "#aed6f1",
                 color = "white", alpha = 0.85) +
  geom_density(color = "#2980b9", linewidth = 1.0) +
  geom_vline(xintercept = obs_I, color = "#c0392b",
             linewidth = 1.4, linetype = "solid") +
  geom_vline(xintercept = E_I, color = "#27ae60",
             linewidth = 1.0, linetype = "dashed") +
  annotate("text", x = obs_I + 0.01, y = Inf, vjust = 1.5,
           label = paste0("Observed I\n= ", round(obs_I, 4)),
           color = "#c0392b", fontface = "bold", size = 3.5) +
  annotate("text", x = E_I - 0.01, y = Inf, vjust = 3.0,
           label = paste0("E[I] = ", round(E_I, 4)),
           color = "#27ae60", size = 3.2, hjust = 1) +
  labs(
    title    = "Permutation Distribution of Moran's I (999 Simulations)",
    subtitle = paste0("Observed I = ", round(obs_I, 4),
                      "  |  p-value = ", round(moran_perm$p.value, 4)),
    x        = "Moran's I (Simulated)",
    y        = "Density",
    caption  = "Red line = Observed Moran's I; Green dashed = Expected I under H0"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    panel.grid    = element_line(color = "grey92"),
    panel.border  = element_rect(color = "grey80", fill = NA)
  )

print(p_perm)
Figure 4: Permutation Distribution of Moran's I (999 simulations) with Observed Statistic

Figure 4: Permutation Distribution of Moran’s I (999 simulations) with Observed Statistic

ggsave(file.path(out_dir, "Fig4_PermutationDistribution.png"),
       p_perm, width = 10, height = 6, dpi = 150)

4.5 Figure 5: Bar Chart of AgeAdjstRatio by County

bar_df <- study_sf %>%
  st_drop_geometry() %>%
  arrange(AgeAdjstRatio) %>%
  mutate(CNTY_NM = factor(CNTY_NM, levels = CNTY_NM),
         color_grp = ifelse(AgeAdjstRatio > mean(AgeAdjstRatio), "Above Mean", "Below Mean"))

p_bar <- ggplot(bar_df, aes(x = CNTY_NM, y = AgeAdjstRatio, fill = color_grp)) +
  geom_col(color = "white", width = 0.7, alpha = 0.85) +
  geom_hline(yintercept = mean(bar_df$AgeAdjstRatio),
             linetype = "dashed", color = "#c0392b", linewidth = 1.0) +
  annotate("text",
           x    = 0.5,
           y    = mean(bar_df$AgeAdjstRatio) + 0.3,
           label = paste0("Mean = ", round(mean(bar_df$AgeAdjstRatio), 2)),
           color = "#c0392b", size = 3.5, hjust = 0) +
  geom_text(aes(label = round(AgeAdjstRatio, 1)), hjust = -0.1,
            size = 3.5, fontface = "bold", color = "#2c3e50") +
  scale_fill_manual(values = c("Above Mean" = "#c0392b", "Below Mean" = "#2980b9")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  coord_flip() +
  labs(
    title    = "Age-Adjusted Late-Stage Colon Cancer Incidence Ratio by County",
    subtitle = "10-County Central Texas Study Area (2005-2014)",
    x        = NULL,
    y        = "Age-Adjusted Ratio (per 100,000)",
    fill     = "Relative to Mean",
    caption  = "Red dashed line = Study area mean"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 13, color = "#2c3e50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    legend.position = "bottom",
    panel.grid.major.y = element_blank(),
    axis.text.y   = element_text(face = "bold")
  )

print(p_bar)
Figure 5: Age-Adjusted Late-Stage Colon Cancer Incidence Ratio by County (ranked)

Figure 5: Age-Adjusted Late-Stage Colon Cancer Incidence Ratio by County (ranked)

ggsave(file.path(out_dir, "Fig5_BarChart_AgeAdjRatio.png"),
       p_bar, width = 10, height = 6, dpi = 150)

4.6 Moran Scatter Plot Quadrant Classification

moran_scatter_df %>%
  select(County, AgeAdjRaw, z_AgeAdj, lag_AgeAdj, Quadrant) %>%
  arrange(Quadrant, desc(z_AgeAdj)) %>%
  kbl(
    caption   = "Table 7: Moran Scatter Plot Quadrant Classification of Study Counties",
    col.names = c("County", "AgeAdjRatio (raw)", "z-score", "Spatial Lag (z)", "LISA Quadrant"),
    digits    = 4,
    align     = c("l","r","r","r","l")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = FALSE,
    position          = "center",
    font_size         = 12
  ) %>%
  footnote(general = paste0(
    "HH = High value surrounded by high values (positive clustering). ",
    "LL = Low value surrounded by low values (positive clustering). ",
    "HL = High value surrounded by low values (negative/dispersion). ",
    "LH = Low value surrounded by high values (negative/dispersion)."
  ))
Table 7: Moran Scatter Plot Quadrant Classification of Study Counties
County AgeAdjRatio (raw) z-score Spatial Lag (z) LISA Quadrant
Caldwell 23.5 1.8599 0.1013 HH (High-High)
Guadalupe 22.3 1.2204 0.4610 HH (High-High)
Bexar 21.1 0.5809 0.3677 HH (High-High)
Bastrop 20.4 0.2078 -0.0053 HL (High-Low)
Hays 19.8 -0.1119 0.0799 LH (Low-High)
Comal 19.1 -0.4850 0.1013 LH (Low-High)
Burnet 19.8 -0.1119 -1.0534 LL (Low-Low)
Travis 18.3 -0.9113 -0.0675 LL (Low-Low)
Williamson 18.2 -0.9646 -0.2718 LL (Low-Low)
Blanco 17.6 -1.2844 -0.4050 LL (Low-Low)
Note:
HH = High value surrounded by high values (positive clustering). LL = Low value surrounded by low values (positive clustering). HL = High value surrounded by low values (negative/dispersion). LH = Low value surrounded by high values (negative/dispersion).

5 Conclusion and Discussion

5.1 Spatial Weights Matrix Discussion

The queen contiguity spatial weights matrix \(W_{ij}\) was successfully constructed for the 10-county Central Texas study area. Key findings from the weights matrix include:

## Average number of neighbors: 3.8
## Range of neighbors: 2 to 6
## Most connected county: Travis ( 6 neighbors )
## Least connected county: Bexar ( 2 neighbor(s) )
  • The average number of queen-contiguous neighbors across the 10 counties is 3.8, indicating a moderately well-connected planar graph for this compact study area.

  • Travis has the highest neighbor count (6 neighbors), reflecting its central geographic position within the study area.

  • Bexar has the fewest contiguous neighbors (2), suggesting a more peripheral location within the study area boundary.

  • The choice of queen contiguity is appropriate for county-level health data where both shared edges and shared vertices represent meaningful spatial interaction pathways (e.g., patient travel patterns, shared healthcare facilities).

  • Row standardization ensures each county contributes equally to the spatial lag computation, which is the standard convention for computing Moran’s I.

5.2 Moran’s I Discussion

The observed Moran’s I = 0.2075 is positive relative to the expected value under complete spatial randomness (E[I] = -0.1111). The p-value from the permutation test (999 simulations) is 0.059, indicating that the result is not statistically significant at the alpha = 0.05 level.

5.2.1 Interpretation

Substantive interpretation: An I value of 0.2075 (compared to E[I] = -0.1111) indicates positive spatial autocorrelation, meaning counties with similar (either both high or both low) age-adjusted colon cancer late-stage incidence ratios tend to be located near each other. This spatial clustering could reflect shared healthcare access barriers, similar socioeconomic profiles, or shared exposure to risk factors within contiguous counties.

5.2.2 Moran’s I in Context

  1. Scale effects: With only \(n = 10\) counties, the statistical power of Moran’s I is inherently limited. Small sample sizes increase the variance of the statistic, making it harder to reject the null hypothesis even when true spatial structure exists. This is a known limitation of applying global Moran’s I to small study areas.

  2. Permutation vs. normal approximation: The permutation-based p-value is more reliable for small \(n\), as it does not assume normality of the underlying data. Both methods are reported for completeness.

  3. LISA decomposition: The Moran scatter plot (Figure 3) reveals the quadrant assignments of individual counties. Counties in the HH quadrant (high-high) contribute most strongly to positive autocorrelation, while HL/LH counties represent spatial outliers that dampen the global statistic.

  4. Health geography context: Spatial clustering in late-stage colon cancer incidence often reflects disparities in screening rates, healthcare access, poverty, and racial composition. Counties in the LL quadrant (low-low cluster: Travis, Hays, Williamson, Blanco) align with the well-resourced Austin Metro corridor where healthcare infrastructure is more accessible.

  5. Policy implications: If spatial clustering is confirmed at finer geographic scales or across broader regions, spatially targeted interventions (e.g., mobile screening units, county-level Medicaid expansion outreach) would be more efficient than uniform statewide policies.


6 References

Anselin, L. (1995). Local indicators of spatial association – LISA. Geographical Analysis, 27(2), 93-115.

Anselin, L. (2019). A Local Indicator of Multivariate Spatial Association: Extending Geary’s C. Geographical Analysis, 51(2), 133-150.

Bivand, R. S., Pebesma, E., and Gomez-Rubio, V. (2013). Applied Spatial Data Analysis with R, 2nd edition. Springer.

Getis, A. and Ord, J. K. (1992). The analysis of spatial association by use of distance statistics. Geographical Analysis, 24(3), 189-206.

Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17-23.

Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439-446.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.


## === Session Information ===
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.3.0       gridExtra_2.3      viridis_0.6.5      viridisLite_0.4.2 
##  [5] RColorBrewer_1.1-3 kableExtra_1.4.0   knitr_1.50         ggplot2_3.5.1     
##  [9] dplyr_1.1.4        readxl_1.4.5       spdep_1.3-10       spData_2.3.4      
## [13] sf_1.0-16         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1   farver_2.1.2       fastmap_1.2.0      TH.data_1.1-3     
##  [5] digest_0.6.35      lifecycle_1.0.4    LearnBayes_2.15.1  survival_3.8-3    
##  [9] magrittr_2.0.3     compiler_4.3.3     rlang_1.1.3        sass_0.4.9        
## [13] tools_4.3.3        igraph_2.1.4       yaml_2.3.10        labeling_0.4.3    
## [17] sp_2.2-0           classInt_0.4-11    xml2_1.3.8         multcomp_1.4-28   
## [21] KernSmooth_2.23-26 withr_3.0.2        grid_4.3.3         e1071_1.7-16      
## [25] colorspace_2.1-1   MASS_7.3-60.0.1    cli_3.6.2          mvtnorm_1.3-3     
## [29] rmarkdown_2.29     ragg_1.3.3         generics_0.1.3     rstudioapi_0.17.1 
## [33] DBI_1.2.3          cachem_1.1.0       proxy_0.4-27       stringr_1.5.1     
## [37] splines_4.3.3      spatialreg_1.3-6   s2_1.1.7           cellranger_1.1.0  
## [41] vctrs_0.6.5        boot_1.3-31        Matrix_1.6-5       sandwich_3.1-1    
## [45] jsonlite_2.0.0     systemfonts_1.2.1  jquerylib_0.1.4    units_0.8-7       
## [49] glue_1.8.0         codetools_0.2-20   stringi_1.8.7      gtable_0.3.6      
## [53] deldir_2.0-4       munsell_0.5.1      tibble_3.2.1       pillar_1.10.1     
## [57] htmltools_0.5.8.1  R6_2.6.1           wk_0.9.4           textshaping_1.0.0 
## [61] evaluate_1.0.3     lattice_0.22-6     bslib_0.9.0        class_7.3-23      
## [65] Rcpp_1.0.12        svglite_2.1.3      coda_0.19-4.1      nlme_3.1-168      
## [69] mgcv_1.9-1         xfun_0.51          zoo_1.8-13         pkgconfig_2.0.3
## All outputs saved to: C:/GEOPLATFORM/GEO7301_ADQUAN/R_Work/Auto Correlation
## Files saved:
##  - Fig1_StudyAreaMap.png
##  - Fig2_ConnectivityGraph.png
##  - Fig3_MoranScatterPlot.png
##  - Fig4_PermutationDistribution.png
##  - Fig5_BarChart_AgeAdjRatio.png
##  - Moran_Scatter_Quadrants.csv
##  - MoransI_Results_Table.csv
##  - Neighbor_Connectivity_Table.csv
##  - StudyArea_AttributeTable.csv
##  - W_Binary_Matrix.csv
##  - W_RowStd_Matrix.csv

Report generated using R 4.3.3 | spdep 1.3.10 | sf 1.0.16

Output directory: C:/GEOPLATFORM/GEO7301_ADQUAN/R_Work/Auto Correlation