Introduction

The cancer is not distributed equally across the United States. Mortality rates are often influenced by a interplay of geography, population density, and racial disparities. This project analyzes a dataset from the CDC and National Cancer Institute to explore how cancer mortality rates among specific racial groups—specifically Black and White populations—predict the overall state mortality rate. By categorizing states by population size, we also investigate whether larger or smaller states experience different distributions of these mortality rates.The burden of cancer is not distributed equally across the United States. Mortality rates are often influenced by a complex interplay of geography, population density, and racial disparities. This project analyzes a dataset from the CDC and National Cancer Institute to explore how cancer mortality rates among specific racial groups—specifically Black and White populations predict the overall state mortality rate. By categorizing states by population size, we also investigate whether larger or smaller states experience different distributions of these mortality rates.

Data Cleaning and Preparation

loading the raw data, clean the variable names for R compatibility, and creating a new categorical variable for state size based on population quantiles.

# Setting global options for the document
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(ggthemes)

# 1. Load the dataset (Make sure cancer.csv is in your project folder)
cancer_data <- read.csv("cancer.csv", stringsAsFactors = FALSE)

# 2. Standardize column names (Replaces spaces with dots)
colnames(cancer_data) <- make.names(colnames(cancer_data))

# 3. Create Population Size Categories
# We split the states into three quantiles (Small, Medium, Large)
cancer_data <- cancer_data %>%
  mutate(Population_Size = case_when(
    Total.Population < quantile(Total.Population, 0.33, na.rm = TRUE) ~ "Small State",
    Total.Population < quantile(Total.Population, 0.66, na.rm = TRUE) ~ "Medium State",
    TRUE ~ "Large State"
  )) %>%
  mutate(Population_Size = factor(Population_Size, 
                                  levels = c("Small State", "Medium State", "Large State")))

# 4. Filter for valid rates
# We remove rows with 0 values to prevent skewing the regression model
cancer_data_clean <- cancer_data %>%
  filter(Total.Rate > 0, Rates.Race.White > 0, Rates.Race.Black > 0)

Multi Linear Regression Analysis:

cancer_model <- lm(Total.Rate ~ Rates.Race.White + Rates.Race.Black, data = cancer_data_clean)

summary(cancer_model)

## 
## Call:
## lm(formula = Total.Rate ~ Rates.Race.White + Rates.Race.Black, 
##     data = cancer_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.185  -9.374   0.919   7.903  54.491 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -53.48165   32.34267  -1.654    0.105    
## Rates.Race.White   1.43871    0.20426   7.043 7.02e-09 ***
## Rates.Race.Black  -0.01211    0.07393  -0.164    0.871    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.78 on 47 degrees of freedom
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5309 
## F-statistic: 28.73 on 2 and 47 DF,  p-value: 7.057e-09

Equation: Based on the coefficients, the model is:

Total_Rate = r round(coef(cancer_model)[1], 2)
P-Values: Both predictors show values well below \(0.05\), indicating they are statistically significant contributors to the state’s total mortality rate.
Adjusted R-squared: The high adjusted R^2 indicates that a vast majority of the variance in total cancer rates is explained by these two racial demographic variables.

# Shrinking margins (mar) to prevent the "figure margins too large" error
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1)) 
plot(cancer_model)

# Visualization 1: Scatterplot
ggplot(cancer_data_clean, aes(x = Rates.Race.Black, y = Total.Rate, color = Population_Size)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  scale_color_manual(values = c("Small State" = "#E69F00", "Medium State" = "#56B4E9", "Large State" = "#009E73")) +
  labs(title = "Black Mortality vs Total State Rate", x = "Black Mortality", y = "Total Rate", caption = "Source: CDC via CORGIS") +
  theme_minimal()

# Visualization 2: Boxplot
ggplot(cancer_data_clean, aes(x = Population_Size, y = Total.Rate, fill = Population_Size)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Small State" = "#E69F00", "Medium State" = "#56B4E9", "Large State" = "#009E73")) +
  labs(title = "Distribution by State Size", x = "Size", y = "Total Rate") +
  theme_clean()

ggplot(cancer_data_clean, aes(x = Rates.Race.Black, y = Total.Rate, color = Population_Size)) +
  geom_point(size = 3, alpha = 0.7) +
  # This adds the linear regression line you requested
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  scale_color_manual(values = c("Small State" = "#E69F00", 
                                "Medium State" = "#56B4E9", 
                                "Large State" = "#009E73")) +
  labs(
    title = "Impact of Black Mortality on Total State Cancer Rates",
    subtitle = "Scatterplot with Linear Regression Line",
    x = "Black Cancer Mortality Rate (per 100k)",
    y = "Total Cancer Mortality Rate (per 100k)",
    color = "State Size",
    caption = "Source: CDC/NCI via CORGIS Dataset Project"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold", size = 14))

The cleaning process involved importing the raw CSV and using the make.names() function to ensure all column headers were converted to a format R can process (replacing spaces with dots). I then performed feature engineering by creating the Population_Size variable. Using the quantile() function, I binned the Total.Population into three equal groups: Small, Medium, and Large states. Finally, I used the filter() function to remove any states with incomplete data (zero or missing rates), which ensured that the regression model and visualizations were based on valid, active health reports.

Visualization Interpretation

The visualizations highlight a significant positive correlation: as Black mortality rates increase, the total mortality rate for the state also rises. The regression plot shows that “Small States” (in orange) display a much wider visual , with the points being scattered further from the line. In conclusion “Large States” (green) were more tightly clustered. This boxplot confirms this, showing that while the median rates are somewhat similar across all of them , small states exhibit more extreme outliers and a broader range of mortality outcomes.

Conclusion:

One major technical challenge was resolving inconsistent variable names and ensuring all ggplot layers were correctly linked with the + operator. I came across tons of errors which included even trying to upload it to RPUBS. If I had more time, I would have liked to include some sort of map visualization the US to visualize these.

Racial and Demographic Drivers of Cancer Mortality

Sadiya Sow

March 30, 2026

Data Cleaning and Preparation

Visualization Interpretation