Final Project

Mammography Rates by County in North Carolina

Introduction

An individual’s socioeconomic status will often indicate their access (or lack there of) to preventative health measures. In this project, the relationship between the amount of mammograms per a given population to the median income of the given population. It is important to understand this relationship in order to promote policy changes in favor of those who lack access to preventative healthcare. Furthermore, this may encourage further research into healthcare access in general to North Carolinians.

Data Preparation

#Call on libraries
library(tidyverse)
library(sf)
library(tigris)
library(ggplot2)
library(tmap)
library(spdep)
library(knitr)
library(dplyr)
library(tidyr)

### Relative File Path
dat_path <- file.path("../Code", "analytic_data2023_0.csv")
### Read data 
dat <- read_csv(dat_path)
### Ensure all data used is in numeric form
dat$`Mammography Screening raw value` <- as.numeric(dat$`Mammography Screening raw value`)

### Rename columns
dat <- dat |>
  rename(mamm_raw = `Mammography Screening raw value`,
         pop = `Population raw value`,
         state = `State Abbreviation`,
         county = `Name`)
### Select columns that you need to use
dat<- dat |>
  select(mamm_raw,
         pop,
         state,
         county)
### Filter by state
dat <- dat |>
  filter(state == "NC")
### Mutate data to create equation for the percentage of the population that has gotten a mammography
dat <- dat |>
  mutate(per_pop_mamm = mamm_raw * 100)
### Select spatial data layer
nc_cnty <- read_sf("../Code/nc_county_bound.gpkg") 

## Table join spatial layer with main data layer

dat_nc_counties <- left_join(nc_cnty, dat[-1,], by = c("NAMELSAD" = "county"))

The main data for this project comes from the US County Health Rankings. The secondary, spatial data layer, comes from the tigris package’s census data for counties. After downloading the Health Rankings data, I decided that I would focus on the mammography screening raw value. To create an equation that exemplified their relationship, I also needed information on population, state, and county. The US County Health Ranking data table is huge so I had to narrow it down to only show data about NC. The final equation for the mammography-screening-rate to population can be easily broken down. It is the mammography screenings rate decimal in the county multiplied by 100. This value is the the percentage of the population which has received a mammography. This data was then joined with the tigris county data.

Exploratory Spatial Data Analysis

Data Description and Summary

dat_summary <- dat |> # create data summary
  summarize(
    mean_per_pop_mamm = mean(per_pop_mamm, na.rm = TRUE), #Calculate mean.
    min_per_pop_mamm = min(per_pop_mamm, na.rm = TRUE), #Calculate min.
    max_per_pop_mamm = max(per_pop_mamm, na.rm = TRUE), #Calculate max.
    na_count = sum(is.na(per_pop_mamm)), #Calculate NAs
    total_ob = n() #Calculate the total amount of observations
  )
dat |> ## Histogram
  ggplot(mapping = aes(x = per_pop_mamm)) + # Ggplot to create histogram
  geom_histogram(binwidth = 3.62,
                 fill = "forestgreen",
                 color = "black") + ### Customize histogram with green fill and black borders
  labs(title = "Percent of Population who Received Mammogram", # Large title
                 x = "Mammography to Population Ratio",
                 y = "Frequency") ### Customize title of x and y axis

There are 101 total observations in the data set. 0 are NAs. 41.3267327 is the average value for mammography screening rate. 25 is the minimum value, and 54 is the maximum value. Each value is a percentage of that county’s population.

Geographic Distribution and Spatial Clustering

dat_nc_counties <- st_sf(dat_nc_counties) #Convert data to sf
dat_nc_counties_tmap <- tm_shape(dat_nc_counties) + 
  tm_polygons("per_pop_mamm", # variable being mapped
              palette = "Reds", #color palette
              style = "jenks", #Style of the map
              title = "Mammography Rates per County", # Title of legend
              border.col = "Black", # Polygon line color
              border.alpha = 0.25, # Polygon line opacity
              legend.hist = TRUE) + # Show legend
  tm_layout(frame = FALSE, # No frame around map
            main.title = "North Carolina Women's Preventative Health Disparities", # Big title of the map
            main.title.size = 1.25, # Font size of title
            legend.outside = TRUE) # Show legend

### Subset data to ONLY observations with values
### Remove NAs prior to Moran's I and LISA analysis
dat_nc_counties <- dat_nc_counties |>
  filter(!is.na(per_pop_mamm))

## Create Queen case 
dnc_nb_queen <- poly2nb(dat_nc_counties, 
                       queen = TRUE)

## Print summary to screen
dnc_nb_queen

## Neighbour list object:
## Number of regions: 100 
## Number of nonzero links: 490 
## Percentage nonzero weights: 4.9 
## Average number of links: 4.9

## Convert  object to weight matrix
dnc_wm_queen <-  nb2listw(dnc_nb_queen,
                         style = "B",         # For binary
                         zero.policy = TRUE)  # Allow obs with 0 


### Moran's I

dat_nc_counties_moran <- moran.test(dat_nc_counties$per_pop_mamm,  # Variable being analyzed
                           listw = dnc_wm_queen,       # Sp weights matrix
                           alternative = "two.sided", # Clustering or Dispersion
                           randomisation = TRUE,      # Compare to randomized values
                           zero.policy = TRUE)        # Allow obs with 0 

## Summary
dat_nc_counties_moran

## 
##  Moran I test under randomisation
## 
## data:  dat_nc_counties$per_pop_mamm  
## weights: dnc_wm_queen    
## 
## Moran I statistic standard deviate = 4.8024, p-value = 1.568e-06
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.286301377      -0.010101010       0.003809355

### LISA
#
dat_nc_counties_lisa <- localmoran(dat_nc_counties$per_pop_mamm, # The variable which I am analyzing, per_pop_mamm
                          listw = dnc_wm_queen,       # Weights object
                          alternative = "two.sided", # Clustering or Dispersion
                          zero.policy = TRUE) |>     # no policy
  as_tibble() |>                                     # object type
  mutate(across(everything(), as.vector))            # Remove uneccesary from localmoran output
  
# Values required for LISA category
dat_nc_counties_lisa <- dat_nc_counties_lisa |>
  mutate(SCVAR =  scale(dat_nc_counties$per_pop_mamm) |> as.vector(),           # Original data column
         LAGVAR = lag.listw(dnc_wm_queen, scale(dat_nc_counties$per_pop_mamm)),  # Lag of original data column
         LISACAT = case_when(SCVAR >= 0 & LAGVAR >= 0 & `Pr(z != E(Ii))` <= 0.05 ~ 1,
                             SCVAR <= 0 & LAGVAR <= 0 & `Pr(z != E(Ii))` <= 0.05 ~ 2,
                             SCVAR >= 0 & LAGVAR <= 0 & `Pr(z != E(Ii))` <= 0.05 ~ 3,
                             SCVAR <= 0 & LAGVAR >= 0 & `Pr(z != E(Ii))` <= 0.05 ~ 4,
                             TRUE ~ NA_real_))

# Label based on the values
dat_nc_counties_lisa <- dat_nc_counties_lisa |>
  mutate(CATNAME = case_when(LISACAT == 1 ~ "HH",
                             LISACAT == 2 ~ "LL",
                             LISACAT == 3 ~ "HL",
                             LISACAT == 4 ~ "LH",
                             LISACAT == 5 ~ "Not Significant"))

## Summary of LISA output
lisa_summary <- table(dat_nc_counties_lisa$CATNAME)


## Add LISA category column to the spatial data

dat_nc_counties <- dat_nc_counties |>
  mutate(LISACAT = dat_nc_counties_lisa$LISACAT,
         CATNAME = dat_nc_counties_lisa$CATNAME)

## LISA kable table
kable(
  as.data.frame(lisa_summary), 
      caption = "Summary of LISA Analysis Results",
      col.names = c("Category", "Frequency"),
  format = "html",
  table.attr = 'class="table table-striped" style="font-size: 14px;"',
  align = c("l", "r")
  )

Summary of LISA Analysis Results
Category	Frequency
HH	6
LL	5

### MAPS

# Second, the LISA map
lisa_tmap <- tm_shape(dat_nc_counties) + 
  tm_polygons("LISACAT", 
              title = "LISA Category", ## Title of the key
              breaks = c(1, 2, 3, 4, 5, 6), 
              palette =  c("red", ## Color scheme for the legend
                           "blue", 
                           "lightpink", 
                           "skyblue", 
                           "grey90"),
              colorNA = "white",
              labels = c("High-High", ## Labels for the legend
                         "Low-Low",
                         "High-Low",
                         "Low-High", 
                         "Not significant"),
              border.col = "black", ## Color of the polygon lines
              border.alpha = 0.25) + ## Transparency for polygon lines 
  tm_layout(frame = FALSE,          ## No frame 
            legend.outside = TRUE) ## Legend on the outside of the map

### Map them together
tmap_arrange(dat_nc_counties_tmap, 
             lisa_tmap,
             ncol = 1,
             nrow = 2,
             sync = TRUE)

Based on the choropleth map, coastal NC has higher mammography rates. Western-most NC and central NC have lower mammography rates. Areas which are more rural have low mammography rates. Inversely, the more metropolitan areas have higher mammography rates.

The Moran’s I statistic is 0.2863014. The p value is 1.5679576^{-6}. The Moran’s I suggests a positive spatial autocorrelation between counties with high mammography rates. The low p-value suggests that this is a pattern and not just a random occurrence.

The LISA analysis reveals the spatial distribution of mammography in each county in North Carolina.6 counties are high-high clusters and are mainly throughout the southeast. Low-low clusters have a count of 5, which is mostly distributed in the western tip of the state. The fact that there are no LH or HL clusters suggests that there are not significant outlier areas where they are bordered by counties with the opposite value. Areas with higher mammography rates on the choropleth map coincide with the HH clusters on the LISA model. The same is true for areas with lower mammography rates, which coincide with the LL clusters. Both HH and LL indicate that these values are clustered spatially.

Conclusions

After collecting, wrangling, and doing a spatial analysis on the mammography rates in North Carolina counties, there appears to be a very high statistical significance. Based on my computations, the p-value is very close to zero. Furthermore, based on the choropleth map and LISA map having a similar spatial pattern, it is safe to say the disparities in mammography rates are worth exploring. If I were to add another variable/spatial layer, it would be the median income in each county. I would hypothesize that low income areas would correlate with low mammography rates.

Document Statistics

Method	koRpus	stringi
Word count	562	568
Character count	3550	3550
Sentence count	38	Not available
Reading time	2.8 minutes	2.8 minutes