According to the World Health Organization, Lung cancer remains one of the leading causes of cancer-related mortality worldwide, with significant public health implications. Despite advancements in medical research and treatment options, the incidence rates of lung cancer still continue to vary widely across different geographic regions and demographic groups (Siegel, Miller, & Jemal, 2021). Using data from the National Cancer Institute (NCI) and the Centers for Disease Control and Prevention (CDC), I conducted this thorough analysis on lung cancer incidence rates across the different states in the US as well as between genders. By identifying states with the highest and lowest incidence rates and comparing the rates between men and women, this article aims to highlight the areas that require focused public health efforts. My initial hypothesis before carrying out this analysis was that states with higher industrial activities and environmental pollutants will be more susceptible to higher prevalence of lung cancer.
R Programming Language: Utilized for major data analysis and visualization of this analysis.
The R packages primarily employed for conducting this analysis include: ggplot2 (For creating visualizations), sf (For handling spatial data), dplyr (for data manipulation), reshape2 (for reshaping data), maps and mapdata (for obtaining US state map data).
Data Sources:
Data on lung cancer incidence rates across different states in the
United States was extracted from the National Cancer Institute
(NCI), which included comprehensive statistics on incidence rates
by state and gender. Supplemental data was acquired from the Centers
for Disease Control and Prevention (CDC) to validate and provide
additional context for lung cancer incidence rates. All sources were
open data sources.
Data Variables:
State Abbreviation (STATE_ABBR): This variable includes
two-letter abbreviations representing each state in the US.
Lung Cancer Incidence Rates: The primary variables include AllAge_B_AA_Rate (incidence rates for all ages and both genders combined), AllAge_M_AA_Rate (incidence rates for males), and AllAge_F_AA_Rate (incidence rates for females).
Data Cleaning:
Ensured consistency in the formatting and naming of variables across the
dataset. Identified and appropriately handled any missing or incomplete
data entries to maintain data integrity.
Data Loading and Transformation:
The dataset was loaded into R for analysis. State abbreviations were
converted to lowercase to ensure consistency and compatibility with map
data.
Descriptive Statistics:
Summary statistics, including mean, median, minimum, and maximum values,
were generated for the incidence rates of lung cancer across the
different states. The average incidence rates were calculated for males
and females as well to provide a clear comparison between genders.
Comparative Analysis:
An Analysis of Variance (ANOVA) was conducted to test for
significant differences in the mean incidence rates of lung cancer among
the different states. This test aimed to evaluate whether the state of
residence has a significant impact on lung cancer incidence rates. A
Welch Two Sample t-test was performed to determine the
significance of differences in lung cancer incidence rates between males
and females. This test aimed to evaluate whether gender has a
significant impact on lung cancer incidence rates.
In order to visualize the analysis graphically, I generated bar graphs, a heatmap and choropleth that pricisely compare the incidence rates across the different states and between genders. These visualizations provide a clear visual representation of states with the highest and lowest rates, as well as a comparison rates between males and females.
library(tidyverse)
library(dplyr)
library(knitr)
library(kableExtra)
state_incidence_table <- lungcancer %>%
arrange(desc(AllAge_B_AA_Rate)) %>%
select(STATE_NAME, STATE_ABBR, AllAge_B_AA_Rate, AllAge_M_AA_Rate, AllAge_F_AA_Rate)
kable(state_incidence_table, col.names = c("STATE", "ABBR", "STATE RATE", "MALE RATE", "FEMALE RATE"), caption = "Table 1. Lung Cancer Incidence Rates by State and Gender") %>%
kable_styling(full_width = TRUE, position = "center", font_size = 12) %>%
row_spec(0, background = "lightgray") %>%
row_spec(1:9, background = "lightcoral") %>%
row_spec(10:42, background = "pink") %>%
row_spec(43:51, background = "paleturquoise")
STATE | ABBR | STATE RATE | MALE RATE | FEMALE RATE |
---|---|---|---|---|
Kentucky | KY | 92.24 | 111.25 | 77.84 |
West Virginia | WV | 79.34 | 95.18 | 66.99 |
Arkansas | AR | 78.14 | 97.80 | 62.61 |
Mississippi | MS | 75.53 | 99.46 | 57.39 |
Tennessee | TN | 75.14 | 93.08 | 61.53 |
Maine | ME | 73.68 | 83.34 | 66.34 |
Missouri | MO | 72.89 | 85.27 | 63.50 |
Indiana | IN | 72.89 | 88.19 | 61.29 |
Rhode Island | RI | 70.57 | 77.99 | 65.73 |
Delaware | DE | 69.46 | 78.90 | 62.63 |
Oklahoma | OK | 69.16 | 83.40 | 57.97 |
North Carolina | NC | 68.76 | 84.86 | 56.69 |
Ohio | OH | 68.51 | 81.05 | 59.12 |
Louisiana | LA | 67.49 | 84.77 | 54.15 |
Alabama | AL | 66.41 | 87.28 | 50.49 |
South Carolina | SC | 65.49 | 81.76 | 52.91 |
Illinois | IL | 64.67 | 75.33 | 56.97 |
New Hampshire | NH | 64.33 | 68.26 | 62.22 |
Georgia | GA | 64.05 | 81.21 | 51.29 |
Michigan | MI | 64.03 | 73.23 | 57.17 |
Pennsylvania | PA | 64.01 | 74.84 | 56.19 |
Iowa | IA | 63.05 | 74.98 | 54.00 |
Vermont | VT | 62.36 | 68.80 | 57.23 |
Massachusetts | MA | 61.81 | 66.28 | 59.08 |
Kansas | KS | 59.85 | 69.42 | 52.73 |
Connecticut | CT | 59.81 | 65.56 | 55.82 |
Wisconsin | WI | 59.75 | 67.65 | 53.90 |
Florida | FL | 58.98 | 68.34 | 51.29 |
South Dakota | SD | 58.95 | 67.92 | 52.86 |
New York | NY | 58.87 | 66.99 | 53.24 |
Virginia | VA | 58.56 | 69.03 | 50.56 |
Nebraska | NE | 57.71 | 67.48 | 50.39 |
North Dakota | ND | 57.26 | 65.69 | 51.08 |
Maryland | MD | 56.43 | 63.75 | 51.09 |
New Jersey | NJ | 56.14 | 62.25 | 52.04 |
Alaska | AK | 56.04 | 64.15 | 48.77 |
Minnesota | MN | 55.97 | 61.81 | 51.78 |
Washington | WA | 55.83 | 61.26 | 51.72 |
Nevada | NV | 55.20 | 57.64 | 53.37 |
Montana | MT | 54.83 | 55.56 | 54.78 |
Oregon | OR | 54.68 | 59.68 | 50.93 |
Texas | TX | 51.93 | 63.40 | 43.06 |
Idaho | ID | 50.34 | 55.60 | 46.28 |
District of Columbia | DC | 49.63 | 54.47 | 46.12 |
Arizona | AZ | 48.09 | 53.05 | 44.03 |
Hawaii | HI | 45.69 | 57.16 | 36.41 |
Wyoming | WY | 44.09 | 46.16 | 42.75 |
Colorado | CO | 42.52 | 45.55 | 40.36 |
California | CA | 42.06 | 47.39 | 38.08 |
New Mexico | NM | 39.57 | 45.49 | 34.88 |
Utah | UT | 26.86 | 31.51 | 23.04 |
# Table of summary statistics
summary_table <- data.frame(
Metric = c("Mean", "Median", "Standard Deviation"),
Value = c(mean(lungcancer$AllAge_B_AA_Rate), median(lungcancer$AllAge_B_AA_Rate), sd(lungcancer$AllAge_B_AA_Rate))
)
kable(summary_table, col.names = c("Metric", "Value"),
caption = "Table 2. Summary Statistics of Lung Cancer Incidence Rates") %>%
kable_styling(full_width = TRUE, position = "center", font_size = 12)%>%
row_spec(0, background = "lightgray") %>%
row_spec(1, background = "khaki") %>%
row_spec(2, background = "lightblue") %>%
row_spec(3, background = "khaki")
Metric | Value |
---|---|
Mean | 60.58137 |
Median | 59.81000 |
Standard Deviation | 11.45977 |
library(knitr)
library(kableExtra)
#T-test analysis for Gender
ttest_gender <- t.test(lungcancer$AllAge_M_AA_Rate, lungcancer$AllAge_F_AA_Rate) #runs a Welch Two Sample t-test for significance between male and female incidence rates
# Extracting t-test results into a data frame
ttest_table <- data.frame(
Statistic = format(ttest_gender$statistic, scientific = TRUE),
P.Value = format(ttest_gender$p.value, scientific = TRUE),
CI.Lower = format(ttest_gender$conf.int[1], scientific = TRUE),
CI.Upper = format(ttest_gender$conf.int[2], scientific = TRUE),
DF = ttest_gender$parameter,
Mean.Males = ttest_gender$estimate[1],
Mean.Females = ttest_gender$estimate[2]
)
# Creating a table of t-test results with a caption
kable(ttest_table, col.names = c("T-Statistic", "P-Value", "CI Lower", "CI Upper", "Degrees of Freedom", "Mean (Males)", "Mean (Females)"),
caption = "Table 3. T-Test Results for Lung Cancer Incidence Rates Between Males and Females") %>%
kable_styling(full_width = FALSE, position = "center", font_size = 12) %>%
row_spec(0, background = "lightgray") %>%
row_spec(1, background = "khaki")
T-Statistic | P-Value | CI Lower | CI Upper | Degrees of Freedom | Mean (Males) | Mean (Females) | |
---|---|---|---|---|---|---|---|
t | 6.764172e+00 | 1.95575e-09 | 1.214821e+01 | 2.227453e+01 | 80.55647 | 70.40137 | 53.19 |
library(ggplot2)
#All_Incidence_Rate
All_lungcancer <- subset(lungcancer, AllAge_B_AA_Rate > 0) #Optional: equivalent to df lungcancer
All_States <- reorder(lungcancer$STATE_NAME, -lungcancer$AllAge_B_AA_Rate)
All_Incidence_Rate <- c(lungcancer$AllAge_B_AA_Rate)
# Creating a bar plot of incidence rates for all the states
ggplot(lungcancer, aes(x = All_States, y = All_Incidence_Rate)) +
geom_bar(stat = "identity", fill="violet", alpha=0.7) +
ggtitle("Lung Cancer Incidence Rates by State") +
theme(axis.text.x = element_text(angle = 100, vjust = 0.5, hjust = 1), plot.title = element_text(hjust = 0.5))
Figure 1. Bar plot of lung cancer incidence rates of all the various states.
library(ggplot2)
#Highest_Incidence_Rate
Highest_lungcancer <- subset(lungcancer, AllAge_B_AA_Rate > 70)
Highest_States <- reorder(Highest_lungcancer$STATE_NAME, -Highest_lungcancer$AllAge_B_AA_Rate)
Highest_Incidence_Rate <- c(Highest_lungcancer$AllAge_B_AA_Rate)
ggplot(Highest_lungcancer) +
geom_bar(aes(x = Highest_States,
y = Highest_Incidence_Rate),
stat ="identity", fill = "red", alpha = 0.7)+
ggtitle("Highest Lung Cancer Incidence Rates") +
theme(axis.text.x = element_text(angle = 100, vjust = 0.5, hjust = 1), plot.title = element_text(hjust = 0.5))
Figure 2. States with the highest lung cancer incidence rates
library(ggplot2)
#Lowest_Incidence_Rate (Optional: Did not include on Sigma Xi Poster due to insufficient space)
Lowest_lungcancer <- subset(lungcancer, AllAge_B_AA_Rate < 51)
Lowest_States <- reorder(Lowest_lungcancer$STATE_NAME, -Lowest_lungcancer$AllAge_B_AA_Rate)
Lowest_Incidence_Rate <- c(Lowest_lungcancer$AllAge_B_AA_Rate)
ggplot(Lowest_lungcancer) +
geom_bar(aes(x = Lowest_States,
y = Lowest_Incidence_Rate),
stat="identity", fill="green", alpha=0.7)+
ggtitle("Lowest Lung Cancer Incidence Rates") +
theme(axis.text.x = element_text(angle = 100, vjust = 0.5, hjust = 1), plot.title = element_text(hjust = 0.5))
Figure 3. States with the lowest lung cancer incidence rates
#Gender (Male vs Female)
Avg_male_rate <- mean(lungcancer$AllAge_M_AA_Rate)
Avg_female_rate <- mean(lungcancer$AllAge_F_AA_Rate)
Gender <- c("Female", "Male")
Rate <- c(Avg_female_rate, Avg_male_rate)
barplot(Rate~Gender, main = "Lung Cancer Incidence Rate by Gender", col = "blue",
cex.main = 1.5, cex.axis = 1, cex.lab = 1.5, font.lab = 2,
font.axis = 1, cex.names = 1.5)
Figure 4. Bar plot showing the different lung cancer rates between males and females
library(ggplot2)
library(reshape2)
library(sf)
library(maps)
library(mapdata)
# Preparing the data for the heatmap
# Assuming 'STATE_ABBR' represents the states and 'AllAge_B_AA_Rate' represents the incidence rates
data_long <- melt(lungcancer, id.vars = "STATE_ABBR", measure.vars = "AllAge_B_AA_Rate", variable.name = "RateType", value.name = "IncidenceRate")
# Getting US map data
states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
# Making geometries valid
states <- st_make_valid(states)
# Disabling spherical geometry
sf_use_s2(FALSE)
# Preparing data
lungcancer$region <- tolower(state.name[match(lungcancer$STATE_ABBR, state.abb)])
states$region <- tolower(states$ID) # Add 'region' column to map data
map_data <- merge(states, lungcancer, by = "region")
# Calculating centroids for state labels
states_centroids <- st_centroid(states)
state_labels <- data.frame(
region = states$region,
B = st_coordinates(states_centroids)[, 1],
A = st_coordinates(states_centroids)[, 2],
state_name = state.name[match(tolower(states$region), tolower(state.name))]
)
# Creating the map with state names
ggplot(map_data) +
geom_sf(aes(fill = AllAge_B_AA_Rate), color = "white", size = 1.0) +
geom_text(data = state_labels, aes(x = B, y = A, label = state_name), color = "black", size = 2, check_overlap = TRUE) +
scale_fill_gradient(low = "white", high = "red", name = "Incidence Rate") +
labs(title = "Map of Lung Cancer Incidence Rates by State",
caption = "Developer: Evans Codjoe") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank())
Figure 5. This map provides visual insights into the distribution of lung cancer incidence across different regions, aiding public health officials and researchers in identifying areas needing targeted interventions and resources.
My findings support my initial hypotheses regarding the influence of state of residence and gender on lung cancer incidence rates. States with higher industrial activities and environmental pollutants, such as Kentucky and West Virginia, exhibited higher lung cancer incidence rates (Fig. 2). This suggests that exposure to industrial radiations and pollutants contributes significantly to lung cancer risk. The gender analysis revealed that males have higher lung cancer incidence rates compared to females (Fig. 4). This aligns with the understanding that males have higher smoking rates, which happens to be a major risk factor for lung cancer.
The significant variations in lung cancer incidence rates among states highlight the need for targeted public health interventions. It would be necessary for states with higher incidence rates to implement stricter regulations on industrial emissions and promote anti-smoking campaigns. Public health policies should focus on reducing exposure to environmental pollutants and increasing awareness about lung cancer risks. Gender-specific interventions, such as tailored smoking cessation programs for men, could be a great way to help address the higher incidence rates among males.
This study elucidates significant geographical and gender disparities in lung cancer incidence rates across the United States. The findings underscore the necessity for targeted public health interventions to mitigate these disparities and reduce the overall burden of lung cancer. Implementing stringent environmental regulations and promoting anti-smoking initiatives, particularly in regions with high incidence rates, is imperative. Such measures are anticipated to significantly lower lung cancer incidence rates and enhance public health outcomes. By addressing both environmental and behavioral risk factors, I strongly believe that a substantial progress can be made in the prevention and early detection of lung cancer.
[1] World Health Organization, Lung Cancer (2023), https://www.who.int/news-room/fact-sheets/detail/lung-cancer
[2] National Cancer Institute, Cancer Stat Facts: Lung and Bronchus Cancer (2021), https://seer.cancer.gov/statfacts/html/lungb.html
[3] Center for Disease Control and Prevention, (2024), https://www.cdc.gov/lung-cancer/statistics/index.html
[4] Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics (2021), https://acsjournals.onlinelibrary.wiley.com/doi/10.3322/caac.21654