Introduction

The cancer is not distributed equally across the United States. Mortality rates are often influenced by a interplay of geography, population density, and racial disparities. This project analyzes a dataset from the CDC and National Cancer Institute to explore how cancer mortality rates among specific racial groups—specifically Black and White populations—predict the overall state mortality rate. By categorizing states by population size, we also investigate whether larger or smaller states experience different distributions of these mortality rates.The burden of cancer is not distributed equally across the United States. Mortality rates are often influenced by a complex interplay of geography, population density, and racial disparities. This project analyzes a dataset from the CDC and National Cancer Institute to explore how cancer mortality rates among specific racial groups—specifically Black and White populations predict the overall state mortality rate. By categorizing states by population size, we also investigate whether larger or smaller states experience different distributions of these mortality rates.

Data Cleaning and Preparation

loading the raw data, clean the variable names for R compatibility, and creating a new categorical variable for state size based on population quantiles.

# Setting global options for the document
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(ggthemes)
# 1. Load the dataset (Make sure cancer.csv is in your project folder)
cancer_data <- read.csv("cancer.csv", stringsAsFactors = FALSE)

# 2. Standardize column names (Replaces spaces with dots)
colnames(cancer_data) <- make.names(colnames(cancer_data))

# 3. Create Population Size Categories
# We split the states into three quantiles (Small, Medium, Large)
cancer_data <- cancer_data %>%
  mutate(Population_Size = case_when(
    Total.Population < quantile(Total.Population, 0.33, na.rm = TRUE) ~ "Small State",
    Total.Population < quantile(Total.Population, 0.66, na.rm = TRUE) ~ "Medium State",
    TRUE ~ "Large State"
  )) %>%
  mutate(Population_Size = factor(Population_Size, 
                                  levels = c("Small State", "Medium State", "Large State")))

# 4. Filter for valid rates
# We remove rows with 0 values to prevent skewing the regression model
cancer_data_clean <- cancer_data %>%
  filter(Total.Rate > 0, Rates.Race.White > 0, Rates.Race.Black > 0)

Multi Linear Regression Analysis:

cancer_model <- lm(Total.Rate ~ Rates.Race.White + Rates.Race.Black, data = cancer_data_clean)

summary(cancer_model)
## 
## Call:
## lm(formula = Total.Rate ~ Rates.Race.White + Rates.Race.Black, 
##     data = cancer_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.185  -9.374   0.919   7.903  54.491 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -53.48165   32.34267  -1.654    0.105    
## Rates.Race.White   1.43871    0.20426   7.043 7.02e-09 ***
## Rates.Race.Black  -0.01211    0.07393  -0.164    0.871    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.78 on 47 degrees of freedom
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5309 
## F-statistic: 28.73 on 2 and 47 DF,  p-value: 7.057e-09
# Shrinking margins (mar) to prevent the "figure margins too large" error
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1)) 
plot(cancer_model)

# Visualization 1: Scatterplot
ggplot(cancer_data_clean, aes(x = Rates.Race.Black, y = Total.Rate, color = Population_Size)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  scale_color_manual(values = c("Small State" = "#E69F00", "Medium State" = "#56B4E9", "Large State" = "#009E73")) +
  labs(title = "Black Mortality vs Total State Rate", x = "Black Mortality", y = "Total Rate", caption = "Source: CDC via CORGIS") +
  theme_minimal()

# Visualization 2: Boxplot
ggplot(cancer_data_clean, aes(x = Population_Size, y = Total.Rate, fill = Population_Size)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Small State" = "#E69F00", "Medium State" = "#56B4E9", "Large State" = "#009E73")) +
  labs(title = "Distribution by State Size", x = "Size", y = "Total Rate") +
  theme_clean()

ggplot(cancer_data_clean, aes(x = Rates.Race.Black, y = Total.Rate, color = Population_Size)) +
  geom_point(size = 3, alpha = 0.7) +
  # This adds the linear regression line you requested
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  scale_color_manual(values = c("Small State" = "#E69F00", 
                                "Medium State" = "#56B4E9", 
                                "Large State" = "#009E73")) +
  labs(
    title = "Impact of Black Mortality on Total State Cancer Rates",
    subtitle = "Scatterplot with Linear Regression Line",
    x = "Black Cancer Mortality Rate (per 100k)",
    y = "Total Cancer Mortality Rate (per 100k)",
    color = "State Size",
    caption = "Source: CDC/NCI via CORGIS Dataset Project"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold", size = 14))

The cleaning process involved importing the raw CSV and using the make.names() function to ensure all column headers were converted to a format R can process (replacing spaces with dots). I then performed feature engineering by creating the Population_Size variable. Using the quantile() function, I binned the Total.Population into three equal groups: Small, Medium, and Large states. Finally, I used the filter() function to remove any states with incomplete data (zero or missing rates), which ensured that the regression model and visualizations were based on valid, active health reports.

Visualization Interpretation

The visualizations highlight a significant positive correlation: as Black mortality rates increase, the total mortality rate for the state also rises. The regression plot shows that “Small States” (in orange) display a much wider visual , with the points being scattered further from the line. In conclusion “Large States” (green) were more tightly clustered. This boxplot confirms this, showing that while the median rates are somewhat similar across all of them , small states exhibit more extreme outliers and a broader range of mortality outcomes.

Conclusion:

One major technical challenge was resolving inconsistent variable names and ensuring all ggplot layers were correctly linked with the + operator. I came across tons of errors which included even trying to upload it to RPUBS. If I had more time, I would have liked to include some sort of map visualization the US to visualize these.