Introduction

According to the World Health Organization, Lung cancer remains one of the leading causes of cancer-related mortality worldwide, with significant public health implications. Despite advancements in medical research and treatment options, the incidence rates of lung cancer still continue to vary widely across different geographic regions and demographic groups (Siegel, Miller, & Jemal, 2021). Using data from the National Cancer Institute (NCI) and the Centers for Disease Control and Prevention (CDC), I conducted this thorough analysis on lung cancer incidence rates across the different states in the US as well as between genders. By identifying states with the highest and lowest incidence rates and comparing the rates between men and women, this article aims to highlight the areas that require focused public health efforts. My initial hypothesis before carrying out this analysis was that states with higher industrial activities and environmental pollutants will be more susceptible to higher prevalence of lung cancer.

Methodology

Software and Tools

R Programming Language: Utilized for major data analysis and visualization of this analysis.

R Packages

The R packages primarily employed for conducting this analysis include: ggplot2 (For creating visualizations), sf (For handling spatial data), dplyr (for data manipulation), reshape2 (for reshaping data), maps and mapdata (for obtaining US state map data).

Data Collection

Data Sources:
Data on lung cancer incidence rates across different states in the United States was extracted from the National Cancer Institute (NCI), which included comprehensive statistics on incidence rates by state and gender. Supplemental data was acquired from the Centers for Disease Control and Prevention (CDC) to validate and provide additional context for lung cancer incidence rates. All sources were open data sources.

Data Variables:
State Abbreviation (STATE_ABBR): This variable includes two-letter abbreviations representing each state in the US.

Lung Cancer Incidence Rates: The primary variables include AllAge_B_AA_Rate (incidence rates for all ages and both genders combined), AllAge_M_AA_Rate (incidence rates for males), and AllAge_F_AA_Rate (incidence rates for females).

Data Preparation

Data Cleaning:
Ensured consistency in the formatting and naming of variables across the dataset. Identified and appropriately handled any missing or incomplete data entries to maintain data integrity.

Data Loading and Transformation:
The dataset was loaded into R for analysis. State abbreviations were converted to lowercase to ensure consistency and compatibility with map data.

Statistical Analysis

Descriptive Statistics:
Summary statistics, including mean, median, minimum, and maximum values, were generated for the incidence rates of lung cancer across the different states. The average incidence rates were calculated for males and females as well to provide a clear comparison between genders.

Comparative Analysis:
An Analysis of Variance (ANOVA) was conducted to test for significant differences in the mean incidence rates of lung cancer among the different states. This test aimed to evaluate whether the state of residence has a significant impact on lung cancer incidence rates. A Welch Two Sample t-test was performed to determine the significance of differences in lung cancer incidence rates between males and females. This test aimed to evaluate whether gender has a significant impact on lung cancer incidence rates.

Visualization

In order to visualize the analysis graphically, I generated bar graphs, a heatmap and choropleth that pricisely compare the incidence rates across the different states and between genders. These visualizations provide a clear visual representation of states with the highest and lowest rates, as well as a comparison rates between males and females.

Results

library(tidyverse)
library(dplyr)
library(knitr)
library(kableExtra)


state_incidence_table <- lungcancer %>% 
arrange(desc(AllAge_B_AA_Rate)) %>% 
select(STATE_NAME, STATE_ABBR, AllAge_B_AA_Rate, AllAge_M_AA_Rate, AllAge_F_AA_Rate)

kable(state_incidence_table, col.names = c("STATE", "ABBR", "STATE RATE", "MALE RATE", "FEMALE RATE"), caption = "Table 1. Lung Cancer Incidence Rates by State and Gender") %>%
kable_styling(full_width = TRUE, position = "center", font_size = 12) %>%
             row_spec(0, background = "lightgray") %>%
             row_spec(1:9, background = "lightcoral") %>%
             row_spec(10:42, background = "pink") %>%
             row_spec(43:51, background = "paleturquoise")
Table 1. Lung Cancer Incidence Rates by State and Gender
STATE ABBR STATE RATE MALE RATE FEMALE RATE
Kentucky KY 92.24 111.25 77.84
West Virginia WV 79.34 95.18 66.99
Arkansas AR 78.14 97.80 62.61
Mississippi MS 75.53 99.46 57.39
Tennessee TN 75.14 93.08 61.53
Maine ME 73.68 83.34 66.34
Missouri MO 72.89 85.27 63.50
Indiana IN 72.89 88.19 61.29
Rhode Island RI 70.57 77.99 65.73
Delaware DE 69.46 78.90 62.63
Oklahoma OK 69.16 83.40 57.97
North Carolina NC 68.76 84.86 56.69
Ohio OH 68.51 81.05 59.12
Louisiana LA 67.49 84.77 54.15
Alabama AL 66.41 87.28 50.49
South Carolina SC 65.49 81.76 52.91
Illinois IL 64.67 75.33 56.97
New Hampshire NH 64.33 68.26 62.22
Georgia GA 64.05 81.21 51.29
Michigan MI 64.03 73.23 57.17
Pennsylvania PA 64.01 74.84 56.19
Iowa IA 63.05 74.98 54.00
Vermont VT 62.36 68.80 57.23
Massachusetts MA 61.81 66.28 59.08
Kansas KS 59.85 69.42 52.73
Connecticut CT 59.81 65.56 55.82
Wisconsin WI 59.75 67.65 53.90
Florida FL 58.98 68.34 51.29
South Dakota SD 58.95 67.92 52.86
New York NY 58.87 66.99 53.24
Virginia VA 58.56 69.03 50.56
Nebraska NE 57.71 67.48 50.39
North Dakota ND 57.26 65.69 51.08
Maryland MD 56.43 63.75 51.09
New Jersey NJ 56.14 62.25 52.04
Alaska AK 56.04 64.15 48.77
Minnesota MN 55.97 61.81 51.78
Washington WA 55.83 61.26 51.72
Nevada NV 55.20 57.64 53.37
Montana MT 54.83 55.56 54.78
Oregon OR 54.68 59.68 50.93
Texas TX 51.93 63.40 43.06
Idaho ID 50.34 55.60 46.28
District of Columbia DC 49.63 54.47 46.12
Arizona AZ 48.09 53.05 44.03
Hawaii HI 45.69 57.16 36.41
Wyoming WY 44.09 46.16 42.75
Colorado CO 42.52 45.55 40.36
California CA 42.06 47.39 38.08
New Mexico NM 39.57 45.49 34.88
Utah UT 26.86 31.51 23.04



# Table of summary statistics
summary_table <- data.frame(
  Metric = c("Mean", "Median", "Standard Deviation"),
  Value = c(mean(lungcancer$AllAge_B_AA_Rate), median(lungcancer$AllAge_B_AA_Rate), sd(lungcancer$AllAge_B_AA_Rate))
)
kable(summary_table, col.names = c("Metric", "Value"),
      caption = "Table 2. Summary Statistics of Lung Cancer Incidence Rates") %>%
      kable_styling(full_width = TRUE, position = "center", font_size = 12)%>%
      row_spec(0, background = "lightgray") %>%
      row_spec(1, background = "khaki") %>%
      row_spec(2, background = "lightblue") %>%
      row_spec(3, background = "khaki")
Table 2. Summary Statistics of Lung Cancer Incidence Rates
Metric Value
Mean 60.58137
Median 59.81000
Standard Deviation 11.45977



library(knitr)
library(kableExtra)

#T-test analysis for Gender
ttest_gender <- t.test(lungcancer$AllAge_M_AA_Rate, lungcancer$AllAge_F_AA_Rate) #runs a Welch Two Sample t-test for significance between male and female incidence rates

# Extracting t-test results into a data frame
ttest_table <- data.frame(
  Statistic = format(ttest_gender$statistic, scientific = TRUE),
  P.Value = format(ttest_gender$p.value, scientific = TRUE),
  CI.Lower = format(ttest_gender$conf.int[1], scientific = TRUE),
  CI.Upper = format(ttest_gender$conf.int[2], scientific = TRUE),
  DF = ttest_gender$parameter,
  Mean.Males = ttest_gender$estimate[1],
  Mean.Females = ttest_gender$estimate[2]
)

# Creating a table of t-test results with a caption
kable(ttest_table, col.names = c("T-Statistic", "P-Value", "CI Lower", "CI Upper", "Degrees of Freedom", "Mean (Males)", "Mean (Females)"), 
      caption = "Table 3. T-Test Results for Lung Cancer Incidence Rates Between Males and Females") %>%
      kable_styling(full_width = FALSE, position = "center", font_size = 12) %>%
      row_spec(0, background = "lightgray") %>%
      row_spec(1, background = "khaki")  
Table 3. T-Test Results for Lung Cancer Incidence Rates Between Males and Females
T-Statistic P-Value CI Lower CI Upper Degrees of Freedom Mean (Males) Mean (Females)
t 6.764172e+00 1.95575e-09 1.214821e+01 2.227453e+01 80.55647 70.40137 53.19



library(ggplot2)

#All_Incidence_Rate
All_lungcancer <- subset(lungcancer, AllAge_B_AA_Rate > 0) #Optional: equivalent to df lungcancer
All_States <- reorder(lungcancer$STATE_NAME, -lungcancer$AllAge_B_AA_Rate)
All_Incidence_Rate <- c(lungcancer$AllAge_B_AA_Rate)


# Creating a bar plot of incidence rates for all the states
ggplot(lungcancer, aes(x = All_States, y = All_Incidence_Rate)) +
  geom_bar(stat = "identity", fill="violet", alpha=0.7) +
  ggtitle("Lung Cancer Incidence Rates by State") +
  theme(axis.text.x = element_text(angle = 100, vjust = 0.5, hjust = 1), plot.title = element_text(hjust = 0.5))
Figure 1. Bar plot of lung cancer incidence rates of all the various states.

Figure 1. Bar plot of lung cancer incidence rates of all the various states.



library(ggplot2)

#Highest_Incidence_Rate
Highest_lungcancer <- subset(lungcancer, AllAge_B_AA_Rate > 70)
Highest_States <- reorder(Highest_lungcancer$STATE_NAME, -Highest_lungcancer$AllAge_B_AA_Rate)
Highest_Incidence_Rate <- c(Highest_lungcancer$AllAge_B_AA_Rate)

ggplot(Highest_lungcancer) +
  geom_bar(aes(x = Highest_States, 
               y = Highest_Incidence_Rate), 
           stat ="identity", fill = "red", alpha = 0.7)+ 
           ggtitle("Highest Lung Cancer Incidence Rates") +
           theme(axis.text.x = element_text(angle = 100, vjust = 0.5, hjust = 1), plot.title = element_text(hjust = 0.5))
Figure 2. States with the highest lung cancer incidence rates

Figure 2. States with the highest lung cancer incidence rates



library(ggplot2)

#Lowest_Incidence_Rate (Optional: Did not include on Sigma Xi Poster due to insufficient space)
Lowest_lungcancer <- subset(lungcancer, AllAge_B_AA_Rate < 51)
Lowest_States <- reorder(Lowest_lungcancer$STATE_NAME, -Lowest_lungcancer$AllAge_B_AA_Rate)
Lowest_Incidence_Rate <- c(Lowest_lungcancer$AllAge_B_AA_Rate)

ggplot(Lowest_lungcancer) +
  geom_bar(aes(x = Lowest_States, 
               y = Lowest_Incidence_Rate), 
           stat="identity", fill="green", alpha=0.7)+
           ggtitle("Lowest Lung Cancer Incidence Rates") +
           theme(axis.text.x = element_text(angle = 100, vjust = 0.5, hjust = 1), plot.title = element_text(hjust = 0.5))
Figure 3. States with the lowest lung cancer incidence rates

Figure 3. States with the lowest lung cancer incidence rates



#Gender (Male vs Female)
Avg_male_rate <- mean(lungcancer$AllAge_M_AA_Rate)
Avg_female_rate <- mean(lungcancer$AllAge_F_AA_Rate)

Gender <- c("Female", "Male")
Rate <- c(Avg_female_rate, Avg_male_rate)

barplot(Rate~Gender, main = "Lung Cancer Incidence Rate by Gender", col = "blue", 
        cex.main = 1.5, cex.axis = 1, cex.lab = 1.5, font.lab = 2, 
        font.axis = 1, cex.names = 1.5)
Figure 4. Bar plot showing the different lung cancer rates between males and females

Figure 4. Bar plot showing the different lung cancer rates between males and females



library(ggplot2)
library(reshape2)
library(sf)
library(maps)
library(mapdata)


# Preparing the data for the heatmap
# Assuming 'STATE_ABBR' represents the states and 'AllAge_B_AA_Rate' represents the incidence rates
data_long <- melt(lungcancer, id.vars = "STATE_ABBR", measure.vars = "AllAge_B_AA_Rate", variable.name = "RateType", value.name = "IncidenceRate")


# Getting US map data
states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))

# Making geometries valid
states <- st_make_valid(states)

# Disabling spherical geometry
sf_use_s2(FALSE)

# Preparing data
lungcancer$region <- tolower(state.name[match(lungcancer$STATE_ABBR, state.abb)])
states$region <- tolower(states$ID)  # Add 'region' column to map data

map_data <- merge(states, lungcancer, by = "region")

# Calculating centroids for state labels
states_centroids <- st_centroid(states)
state_labels <- data.frame(
  region = states$region,
   B = st_coordinates(states_centroids)[, 1],
   A = st_coordinates(states_centroids)[, 2],
  state_name = state.name[match(tolower(states$region), tolower(state.name))]
)


# Creating the map with state names
ggplot(map_data) +
  geom_sf(aes(fill = AllAge_B_AA_Rate), color = "white", size = 1.0) +
  geom_text(data = state_labels, aes(x = B, y = A, label = state_name), color = "black", size = 2, check_overlap = TRUE) +
  scale_fill_gradient(low = "white", high = "red", name = "Incidence Rate") +
  labs(title = "Map of Lung Cancer Incidence Rates by State",
       caption = "Developer: Evans Codjoe") +
  theme_minimal() +
  theme(
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank())
Figure 5. This map provides visual insights into the distribution of lung cancer incidence across different regions, aiding public health officials and researchers in identifying areas needing targeted interventions and resources.

Figure 5. This map provides visual insights into the distribution of lung cancer incidence across different regions, aiding public health officials and researchers in identifying areas needing targeted interventions and resources.



Interpretation of Results

My findings support my initial hypotheses regarding the influence of state of residence and gender on lung cancer incidence rates. States with higher industrial activities and environmental pollutants, such as Kentucky and West Virginia, exhibited higher lung cancer incidence rates (Fig. 2). This suggests that exposure to industrial radiations and pollutants contributes significantly to lung cancer risk. The gender analysis revealed that males have higher lung cancer incidence rates compared to females (Fig. 4). This aligns with the understanding that males have higher smoking rates, which happens to be a major risk factor for lung cancer.

Public Health Implications

The significant variations in lung cancer incidence rates among states highlight the need for targeted public health interventions. It would be necessary for states with higher incidence rates to implement stricter regulations on industrial emissions and promote anti-smoking campaigns. Public health policies should focus on reducing exposure to environmental pollutants and increasing awareness about lung cancer risks. Gender-specific interventions, such as tailored smoking cessation programs for men, could be a great way to help address the higher incidence rates among males.

Conclusion

This study elucidates significant geographical and gender disparities in lung cancer incidence rates across the United States. The findings underscore the necessity for targeted public health interventions to mitigate these disparities and reduce the overall burden of lung cancer. Implementing stringent environmental regulations and promoting anti-smoking initiatives, particularly in regions with high incidence rates, is imperative. Such measures are anticipated to significantly lower lung cancer incidence rates and enhance public health outcomes. By addressing both environmental and behavioral risk factors, I strongly believe that a substantial progress can be made in the prevention and early detection of lung cancer.

References

[1] World Health Organization, Lung Cancer (2023), https://www.who.int/news-room/fact-sheets/detail/lung-cancer

[2] National Cancer Institute, Cancer Stat Facts: Lung and Bronchus Cancer (2021), https://seer.cancer.gov/statfacts/html/lungb.html

[3] Center for Disease Control and Prevention, (2024), https://www.cdc.gov/lung-cancer/statistics/index.html

[4] Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics (2021), https://acsjournals.onlinelibrary.wiley.com/doi/10.3322/caac.21654