Let’s get into the project!

Basic Statistics & Visualizations

The first thing I did after obtaining my data sets, was cleaning my data, removing unnecessary columns, fixing column name formats so they can be implemented without causing issues, and other formalities like that.

cancer_counts_clean <- cancer_state_raw |> 
  clean_names() |>
  mutate(state = str_remove(states, " \\(\\d+\\)$")) |>
  filter(!is.na(state), state != "Total")

ggplot(cancer_counts_clean, aes(x = count)) +
  geom_histogram(fill = "pink", color = "white", bins = 15) +
  theme_minimal() +
  labs(
    title = "Distribution of Total Cancer Incidences Across States",
    x = "Number of Cancer Incidences",
    y = "Frequency (Number of States)",
    caption = "Source: CDC WONDER (Raw Counts)"
  )

Interesting, the data is heavily right skewed. Perhaps the cases that are towards the right with counts at 3,000,000 and 4,000,000 are high population states like California and Texas. This histogram alone isn’t a proper depiction of cancer incidences and states. The information provided by this graph is very general and not of much substance to my project just yet. But that’s okay, onwards and upwards!

Let’s try using my other data as well, maybe that will let us take a closer look. I used an inner_join() between the CDC cancer dataset and 2022 ACS population estimates to calculate the annual incidence rate per 100k people. This makes it so high population states do not disproportionately skew the data. One challenge I encountered was the discrepancy between the CDC state names and the tigris geography files. For instance, the CDC data includes “District of Columbia” which I had to filter accordingly to ensure the left_join() didn’t result in missing data on the final plot.I used the shift_geometry() function to include Alaska and Hawaii in a compact view as well.

state_pop <- get_acs(
  geography = "state",
  variables = "B01003_001", 
  year = 2022
) |>
  clean_names() |>
  select(name, estimate) |>
  rename(state = name, population = estimate)

geographical_data <- cancer_state_raw |> 
  clean_names() |>
  mutate(state = str_remove(states, " \\(\\d+\\)$")) |>
  filter(!is.na(state), state != "Total") |>
  inner_join(state_pop, by = "state") |>
  mutate(annual_rate_100k = (count / 24 / population) * 100000)

us_geo <- states(cb = TRUE, resolution = "20m") |>
  shift_geometry() |>
  clean_names()

map_ready <- us_geo |>
  left_join(geographical_data, by = c("name" = "state"))

ggplot(map_ready) +
  geom_sf(aes(fill = annual_rate_100k), color = "white", size = 0.2) +
  scale_fill_viridis_c(option = "plasma", name = "Rate per 100k") +
  theme_void() +
  labs(
    title = "The Geography of Cancer: Incidence Rates by State (2022)",
    caption = "Source: CDC WONDER & US Census API"
  )

As you can see in the graph, cancer incidents by state don’t seem to be uniform: there are evidently states with higher rates. Something I realized as I was observing the graph was that I, in fact, did not know where each US state was. My initial approach with this map didn’t exactly work, I know that one of the bright yellow territory is Maine, however, I am not sure what the the other yellow state is. So, I decided to use an interactive graph, so when I hover over a territory, I know exactly which state it is.

map_ready_leaflet <- map_ready |>
  st_transform(4326)

cancer_pal <- colorNumeric(
  palette = "plasma", 
  domain = map_ready_leaflet$annual_rate_100k
)

leaflet(map_ready_leaflet) |>
  addTiles() |>
  addPolygons(
    fillColor = ~cancer_pal(annual_rate_100k),
    color = "white",
    weight = 0.5,
    fillOpacity = 0.7,
    popup = ~name 
  ) |>
  addLegend(
    pal = cancer_pal, 
    values = ~annual_rate_100k, 
    title = "Annual Rate per 100k"
  )

Much better! Now I see clearly that the states with the higest annual rate per 100K are West Virginia and Maine.

Socioeconimics Nationally

We’ve established that there are some regions with higher cancer incidents than others. As I mentioned, rural hospitals have an evident lack of funding. Rural regions struggle with healthcare, job security, and other factors. Lets pull some of the economic data from CENSUS and see what kind of patterns/statistics we can extract from there. I want to explore if and how socioeconomic disparities exist alongside health disparities.

state_income <- get_acs(
  geography = "state",
  variables = "B19013_001",
  year = 2022
) |>
  clean_names() |>
  select(name, estimate) |>
  rename(state = name, median_income = estimate)

economics_data <- geographical_data |>
  inner_join(state_income, by = "state")

top_5_rich <- economics_data |>
  slice_max(median_income, n = 5) |>
  select(state, median_income, annual_rate_100k)

bottom_5_poor <- economics_data |>
  slice_min(median_income, n = 5) |>
  select(state, median_income, annual_rate_100k)

bind_rows(top_5_rich, bottom_5_poor)

This data displays the 5 richest, and 5 poorest states along with their cancer rates. I don’t see any evident patterns, District of Columbia, the richest area, has a higher cancer rate than New Mexico, the poorest. New Jersey, has a higher rate than Mississippi, Arkansas, Louisiana, and New Mexico. However, West Virginia, one of the poorer states has a very high rate. To better be able to establish whether wealth plays a role in cancer incidents, let’s graph the median income with annual rate so we can better visualize the relationship between the two.

ggplot(economics_data, aes(x = median_income, y = annual_rate_100k)) +
  geom_point(color = "pink", size = 3, alpha = 0.7) +
  scale_x_continuous(labels = scales::dollar) +
  theme_minimal() +
  labs(
    title = "The Impact of Income on Cancer Incidence",
    x = "Median Household Income",
    y = "Average Annual Cancer Rate (per 100k)",
    caption = "Source: CDC WONDER & US Census Bureau (2022)"
  )

The interactive map in the previous section revealed that certain states have higher rates of cancer. When I integrated Census Bureau income data in Section 2, the resulting scatter plot revealed a complex relationship. I had expected expect wealth to directly correlate with lower cancer rates, the data shows quite a bit of variation.

A More Detailed Analysis

Perhaps we can go more in-depth. I’m going to prepare the data for comparison across income brackets so we can see more details and graph it in a different way (other than scatter plot as we used that previously)

bracket_data <- economics_data |>
  mutate(income_bracket = case_when(
    median_income < 65000 ~ "Low Income",
    median_income >= 65000 & median_income < 85000 ~ "Medium Income",
    median_income >= 85000 ~ "High Income"
  ))

comparison_long <- bracket_data |>
  select(state, income_bracket, annual_rate_100k) |>
  pivot_longer(
    cols = annual_rate_100k,
    names_to = "rate_type",
    values_to = "incidence_value"
  )

ggplot(comparison_long, aes(x = income_bracket, y = incidence_value, fill = income_bracket)) + scale_fill_manual(values = c("High Income" = "maroon", 
                               "Medium Income" = "red", 
                               "Low Income" = "lightpink")) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Cancer Incidence Variance by State Income Bracket",
    x = "Income Bracket",
    y = "Annual Rate per 100k"
  )

Well, we see that the median line for that the high income bracket is actually the lowest of the three, yet it displays the largest interquartile range, indicating a high level of health disparities within wealthy states. T “Low Income” bracket has a much tighter distribution but contains a significant outlier, probably West Virginia, suggesting that while most lower-income states hover around a similar rate, perhaps there are some regional factors that can cause extreme results. The “Medium Income” bracket actually maintains the highest median incidence rate across the other brackets. Perhaps cancer incident rate isn’t as greatly impacted by wealth as I had initially suspected.

Using Linear Regression to Continue Observing The Relationship

income_model <- lm(annual_rate_100k ~ median_income, data = economics_data)

ggplot(economics_data, aes(x = median_income, y = annual_rate_100k)) +
  geom_point(color = "pink", size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", color = "maroon", se = TRUE) + 
  theme_minimal() +
  labs(  
    title = "Visualizing the P-Value: Income vs. Cancer",
    x = "Median Household Income",
    y = "Average Annual Cancer Rate (per 100k)"
  )

This regression plot supports my doubt: cancer incident rate isn’t as greatly impacted by wealth as I had initially suspected. The points being dispersed across the graph and the ribbon of standard error’s wideness are indicators that median household income is not a significant predictor of cancer incidence. To truly confirm this statement, let’s observe the p-value. If it is less than 0.05, it is statistically significant. If p-value is greater than 0.05, it is not statistically significant. Based off the regression plot, the scatter plot, box plot, and table of 5 richest/poorest states with income, I predict that the p-value will be higher than 0.05.

model_results <- tidy(income_model)
income_p_value <- model_results |> 
  filter(term == "median_income") |> 
  pull(p.value)

income_p_value

[1] 0.2164253

Aha! The p-value of 0.2164253 is much higher than the 0.05, this means that there is no statistically significant relationship between a state’s median income and its cancer incidence rate.

Statement

My initial suspicion of a relationship between income and cancer incidence has been debunked! However, in my public policy class, the rural hospitals I was mentioning are found at a state/neighborhood scale, not on a national scale. State-wide averages like the ones discussed in this project can mask the high cancer incidences in rural areas. For example, our state of New Jersey was one of the top five wealthiest states, however we may have more localized areas with higher than average cancer rates and low income.

Diving Deeper into NJ’s Cancer Incidence Counts by County

Maybe we should continue exploring by county to really see if there is a relationship between cancer counts and wealth.

nj_cancer_clean <- read_excel("cancer_county.xlsx") |> 
  clean_names() |> 
  select(
    county = county_new_jersey_united_states,
    age_adjusted_rate = age_adjusted_rate_per_100_000_population,
    count
  ) |> 
  mutate(count = as.numeric(str_replace_all(count, ",", ""))) |> 
  filter(county != "New Jersey") |> 
  mutate(county_match = paste(county, "County, New Jersey"))

nj_income_geo <- get_acs(
  geography = "county",
  variables = "B19013_001", 
  state = "NJ",
  year = 2022,
  geometry = TRUE
) |> 
  clean_names() |> 
  select(name, median_income = estimate)

nj_final_data <- nj_income_geo |> 
  left_join(nj_cancer_clean, by = c("name" = "county_match"))

nj_map_ready <- nj_final_data |> 
  st_transform(4326) |> 
  mutate(age_adjusted_rate = as.numeric(age_adjusted_rate)) |> 
  filter(!is.na(age_adjusted_rate)) |> 
  filter(name != "United States" & name != "New Jersey")

nj_pal <- colorNumeric(
  palette = "plasma", 
  domain = nj_map_ready$age_adjusted_rate,
  na.color = "transparent"
)

leaflet(nj_map_ready) |>
  addTiles() |>
  addPolygons(
    fillColor = ~nj_pal(age_adjusted_rate), 
    color = "white",
    weight = 1,
    fillOpacity = 0.7,
    label = ~paste0(name, ": ", age_adjusted_rate),
    popup = ~paste0("<strong>", name, "</strong><br>",
                   "Incidence Rate: ", age_adjusted_rate, " per 100k")
  ) |>
  addLegend(
    pal = nj_pal, 
    values = ~age_adjusted_rate, 
    title = "Incidence per 100k",
    position = "bottomright"
  )

Here’s my graph for NJ counties that show the incidences (notice this time I jumped straight into an interactive map so I could see exactly which counties reflect what numbers). From this graph, I notice that Cape May has the highest count with Sussex, Gloucester, and Ocean County following closely behind. Now, lets go ahead and pull the income data so we can compare the incomes in these counties against the incidence rate.

Socioeconimics for NJ

tibble(
  county = c("Sussex", "Gloucester", "Ocean", "Cape May"),
  median_income = c(105747, 96601, 82379, 76237)
)

Alright, Cape May seems to have a lower median income, however Sussex as a relatively high income. Lets see where these fall with respect to the 5 richest and 5 poorest counties.

top_5_rich <- nj_final_data |>
  slice_max(order_by = median_income, n = 5) |>
  mutate(status = "Richest 5")

bottom_5_poor <- nj_final_data |>
  slice_min(order_by = median_income, n = 5) |>
  mutate(status = "Poorest 5")

nj_income_comparison <- bind_rows(top_5_rich, bottom_5_poor) |>
  as_tibble() |>
  st_drop_geometry() |>
  select(status, name, median_income, age_adjusted_rate) |>
  arrange(desc(median_income))

nj_income_comparison

Based off this table, we see that New Jersey’s cancer statistics are complex. Interestingly enough, Monmouth and Bergen County, two of the richest states, have higher cancer rates than some of the poorer counties. Essex County, has one of the lowest rates. Cape May doesn’t appear in this table, it has the highest cancer incidence rate yet doesn’t constitute as one of the 5 poor states.

Linear Regression

nj_model <- lm(age_adjusted_rate ~ median_income, data = nj_income_comparison)

p_value <- tidy(nj_model) |> 
  filter(term == "median_income") |> 
  pull(p.value)

p_value

[1] 0.4574148

Similar conclusion to the state-wise data, the p-value is large, evidently above 0.05, showing that the relationship is not statistically significant. In New Jersey, you can’t really draw a conclusion between cancer rates and wealth.

Conclusion

As discussed, state-wise there doesn’t seem to be a pattern indicating that poorer states are more susceptible to having cancer. Similarly, after analyzing a richer state’s (New Jersey’s) cancer incidence rate by county, we see that there still isn’t a clear relationship between wealth and cancer counts. I observed that often times the richer states/counties have higher incidence rates than their poorer counterparts. This may be due to those in richer states having better access to healthcare that allows them to seek treatment. Richer states/counties may simply have more recorded incidences than others.

This project was super fun for me to code! Each finding led me to go deeper into analysis, statistics, and data wrangling, I hope it was interesting to you too!

Appendix

Data Acquisition Note: This project utilizes real-time data from the US Census Bureau via the tidycensus API. To reproduce these results, a valid Census API key must be installed using census_api_key(). All demographic estimates are based on the 2022 ACS 5-year survey to ensure temporal alignment with the CDC cancer incidence data and the New Jersey State Health Assessment Data.

State Level Cancer Incidence and Socioeconomic Correlation Study

Ananya Almeida

Project Description

Data Sets