Let’s get into the project!

Basic Statistics & Visualizations

The first thing I did after obtaining by data sets, was cleaning my data, removing unnecessary columns, fixing column name formats so they can be implemented without causing issues, and other formalities like that.

cancer_counts_clean <- cancer_state_raw |> 
  clean_names() |>
  mutate(state = str_remove(states, " \\(\\d+\\)$")) |>
  filter(!is.na(state), state != "Total")

ggplot(cancer_counts_clean, aes(x = count)) +
  geom_histogram(fill = "pink", color = "white", bins = 15) +
  theme_minimal() +
  labs(
    title = "Distribution of Total Cancer Incidences Across States",
    x = "Number of Cancer Incidences",
    y = "Frequency (Number of States)",
    caption = "Source: CDC WONDER (Raw Counts)"
  )

Interesting, the data is heavily right skew. Perhaps the cases that are towards the right with counts at 3,000,000 and 4,000,000 are high population states like California and Texas. This histogram alone isn’t a proper depiction of cancer incidences and states. The information provided by this graph is very general and not of much substance to my project just yet. But that’s okay, onwards and upwards!

Let’s try using my other data as well, maybe that will let us take a closer look. I used an inner_join() between the CDC cancer dataset and 2022 ACS population estimates to calculate the annual incidence rate per 100k people. This makes it so high population states do not disproportionately skew the data. One challenge I encountered was the discrepancy between the CDC state names and the tigris geography files. For instance, the CDC data includes “District of Columbia” which I had to filter accordingly to ensure the left_join() didn’t result in missing data on the final plot.I used the shift_geometry() function to include Alaska and Hawaii in a compact view as well.

state_pop <- get_acs(
  geography = "state",
  variables = "B01003_001", 
  year = 2022
) |>
  clean_names() |>
  select(name, estimate) |>
  rename(state = name, population = estimate)

geographical_data <- cancer_state_raw |> 
  clean_names() |>
  mutate(state = str_remove(states, " \\(\\d+\\)$")) |>
  filter(!is.na(state), state != "Total") |>
  inner_join(state_pop, by = "state") |>
  mutate(annual_rate_100k = (count / 24 / population) * 100000)

us_geo <- states(cb = TRUE, resolution = "20m") |>
  shift_geometry() |>
  clean_names()

map_ready <- us_geo |>
  left_join(geographical_data, by = c("name" = "state"))

ggplot(map_ready) +
  geom_sf(aes(fill = annual_rate_100k), color = "white", size = 0.2) +
  scale_fill_viridis_c(option = "plasma", name = "Rate per 100k") +
  theme_void() +
  labs(
    title = "The Geography of Cancer: Incidence Rates by State (2022)",
    caption = "Source: CDC WONDER & US Census API"
  )

As you can see in the graph, cancer incidents by state don’t seem to be uniform: there are evidently states with higher rates. Something I realized as I was observing the graph was that I, in fact, did not know where each US state was. My initial approach with this map didn’t exactly work, I know that one of the bright yellow territory is Maine, however, I am not sure what the the other yellow state is. So, I decided to use an interactive graph, so when I hover over a territory, I know exactly which state it is.

map_ready_leaflet <- map_ready |>
  st_transform(4326)

cancer_pal <- colorNumeric(
  palette = "plasma", 
  domain = map_ready_leaflet$annual_rate_100k
)

leaflet(map_ready_leaflet) |>
  addTiles() |>
  addPolygons(
    fillColor = ~cancer_pal(annual_rate_100k),
    color = "white",
    weight = 0.5,
    fillOpacity = 0.7,
    popup = ~name 
  ) |>
  addLegend(
    pal = cancer_pal, 
    values = ~annual_rate_100k, 
    title = "Annual Rate per 100k"
  )

Much better! Now I see clearly that the states with the higest annual rate per 100K are West Virginia and Maine.

Socioeconimics

We’ve established that there are some regions with higher cancer incidents than others. As I mentioned, rural hospitals have an evident lack of funding. Rural regions struggle with healthcare, job security, and other factors. Lets pull some of the economic data from CENSUS and see what kind of patterns/statistics we can extract from there. I want to explore if and how socioeconomic disparities exist alongside health disparities.

state_income <- get_acs(
  geography = "state",
  variables = "B19013_001",
  year = 2022
) |>
  clean_names() |>
  select(name, estimate) |>
  rename(state = name, median_income = estimate)

economics_data <- geographical_data |>
  inner_join(state_income, by = "state")

top_5_rich <- economics_data |>
  slice_max(median_income, n = 5) |>
  select(state, median_income, annual_rate_100k)

bottom_5_poor <- economics_data |>
  slice_min(median_income, n = 5) |>
  select(state, median_income, annual_rate_100k)

bind_rows(top_5_rich, bottom_5_poor)

This data displays the 5 richest, and 5 poorest states along with their cancer rates. I don’t see any evident patterns, District of Colombia, the richest state, has a higher cancer rate than New Mexico, the poorest. New Jersey, has a higher rate than Mississippi, Arkansas, Louisiana, and New Mexico. To better be able to establish whether wealth plays a role in cancer incidents, let’s graph the median income with annual rate so we can better visualize the relationship between the two.

ggplot(economics_data, aes(x = median_income, y = annual_rate_100k)) +
  geom_point(color = "pink", size = 3, alpha = 0.7) +
  scale_x_continuous(labels = scales::dollar) +
  theme_minimal() +
  labs(
    title = "The Impact of Income on Cancer Incidence",
    x = "Median Household Income",
    y = "Average Annual Cancer Rate (per 100k)",
    caption = "Source: CDC WONDER & US Census Bureau (2022)"
  )

The interactive map in the previous section revealed that certain states have higher rates of cancer. When I integrated Census Bureau income data in Section 2, the resulting scatter plot revealed a complex relationship. I had expected expect wealth to directly correlate with lower cancer rates, the data shows quite a bit of variation.

A More Detailed Analysis

Perhaps we can go more in-depth. I’m going to prepare the data for comparison across income brackets so we can see more details and graph it in a different way (other than scatter plot as we used that previously)

bracket_data <- economics_data |>
  mutate(income_bracket = case_when(
    median_income < 65000 ~ "Low Income",
    median_income >= 65000 & median_income < 85000 ~ "Medium Income",
    median_income >= 85000 ~ "High Income"
  ))

comparison_long <- bracket_data |>
  select(state, income_bracket, annual_rate_100k) |>
  pivot_longer(
    cols = annual_rate_100k,
    names_to = "rate_type",
    values_to = "incidence_value"
  )

ggplot(comparison_long, aes(x = income_bracket, y = incidence_value, fill = income_bracket)) + scale_fill_manual(values = c("High Income" = "maroon", 
                               "Medium Income" = "red", 
                               "Low Income" = "lightpink")) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Cancer Incidence Variance by State Income Bracket",
    x = "Income Bracket",
    y = "Annual Rate per 100k"
  )

Well, we see that the median line for that the high income bracket is actually the lowest of the three, yet it displays the largest interquartile range, indicating a high level of health disparities within wealthy states. T “Low Income” bracket has a much tighter distribution but contains a significant outlier, probably West Virginia, suggesting that while most lower-income states hover around a similar rate, perhaps there are some regional factors that can cause extreme results. The “Medium Income” bracket actually maintains the highest median incidence rate across the other brackets. Perhaps cancer incident rate isn’t as greatly impacted by wealth as I had initially suspected.

Using Linear Regression to Continue Observing The Relationship

income_model <- lm(annual_rate_100k ~ median_income, data = economics_data)

ggplot(economics_data, aes(x = median_income, y = annual_rate_100k)) +
  geom_point(color = "pink", size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", color = "maroon", se = TRUE) + 
  theme_minimal() +
  labs(  
    title = "Visualizing the P-Value: Income vs. Cancer",
    x = "Median Household Income",
    y = "Average Annual Cancer Rate (per 100k)"
  )

This regression plot supports my doubt: cancer incident rate isn’t as greatly impacted by wealth as I had initially suspected. The points being dispersed across the graph and the ribbon of standard error’s wideness are indicators that median household income is not a significant predictor of cancer incidence. To truly confirm this statement, let’s observe the p-value. If it is less than 0.05, it is statistically significant. If p-value is greater than 0.05, it is not statistically significant. Based off the regression plot, the scatter plot, box plot, and table of 5 richest/poorest states with income, I predict that the p-value will be higher than 0.05.

model_results <- tidy(income_model)
income_p_value <- model_results |> 
  filter(term == "median_income") |> 
  pull(p.value)

income_p_value

[1] 0.2164253

Aha! The p-value of 0.2164253 is much higher than the 0.05, this means that there is no statistically significant relationship between a state’s median income and its cancer incidence rate.

Conclusion

My initial suspicion of a relationship between income and cancer incidence has been debunked!

State Level Cancer Incidence and Socioeconimic Correlation Study

Ananya Almeida