For this project, the New York City Leading Causes of Death dataset was utilized. This dataset is available on the City of New York’s official website. The following sections outline the reasons for selecting this particular dataset for our analysis.
In this section we prepare the data:
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Load data
# Assuming the dataset is loaded as a CSV file from your GitHub link
data <- read.csv("https://raw.githubusercontent.com/Jomifum/ProjectproposalD606/main/New_York_City_Leading_Causes_of_Death_20241016.csv")
# View the first few rows of the dataset
head(data)
## Year Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## Sex Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1 F Black Non-Hispanic 83 7.9 6.9
## 2 F Hispanic 96 8 8.1
## 3 F Hispanic 155 12.9 16
## 4 F Hispanic 1445 122.3 160.7
## 5 F Asian and Pacific Islander 14 2.5 3.6
## 6 F Asian and Pacific Islander 36 6.8 8.5
#Research question Is there a relationship between leading causes of death and demographic factors such as race, sex, and age-adjusted death rates over the years?
#Cases What are the cases, and how many are there?
Each case represents a specific cause of death, gender, and racial/ethnic group across various years. There are nrow(data) observations in the dataset.
#Data collection Describe the method of data collection.
The data was collected from public health records detailing causes of death across different demographic groups, including year, gender, race/ethnicity, and calculated age-adjusted death rates.
#Type of study What type of study is this?
This is an observational study.
#Data Source This data was collected from New York State Department of Health at: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data
Also the data source is publicly accessible on GitHub at: https://github.com/Jomifum/ProjectproposalD606/blob/main/New_York_City_Leading_Causes_of_Death_20241016.csv.
#Variables Description: Year: The year in which the deaths were recorded (quantitative). Leading Cause: The primary cause of death (qualitative). Sex: Gender of the deceased (qualitative). Race Ethnicity: Racial/ethnic group classification (qualitative). Deaths: Number of deaths for the given year, cause, and demographic group (quantitative). Death Rate: Death rate per 100,000 population (quantitative). Age Adjusted Death Rate: Age-adjusted death rate per 100,000 population (quantitative). If running a regression, the dependent variable would likely be ‘Age Adjusted Death Rate.’
#Relevant summary statistics Provide summary statistics for each of the variables. Include visualizations related to the research question.
##Cleaning data
In this section the data is cleaned and tidied to perform statistics and visualizations analysis:
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(psych)
library(stringr)
# Load data
data <- read.csv("https://raw.githubusercontent.com/Jomifum/ProjectproposalD606/main/New_York_City_Leading_Causes_of_Death_20241016.csv")
# View the first few rows of the dataset
head(data)
## Year Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## Sex Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1 F Black Non-Hispanic 83 7.9 6.9
## 2 F Hispanic 96 8 8.1
## 3 F Hispanic 155 12.9 16
## 4 F Hispanic 1445 122.3 160.7
## 5 F Asian and Pacific Islander 14 2.5 3.6
## 6 F Asian and Pacific Islander 36 6.8 8.5
# Remove any leading or trailing whitespace from column names
colnames(data) <- trimws(colnames(data))
# Convert column names to lowercase and replace spaces with underscores
colnames(data) <- tolower(colnames(data))
colnames(data) <- gsub("\\s+", "_", colnames(data))
colnames(data) <- gsub("\\.", "_", colnames(data)) # Replace dots with underscores
# Re-inspect column names to ensure they’re clean
print(colnames(data))
## [1] "year" "leading_cause"
## [3] "sex" "race_ethnicity"
## [5] "deaths" "death_rate"
## [7] "age_adjusted_death_rate"
# Handle missing values indicated by "." in numeric columns by replacing them with NA
data$deaths[data$deaths == "."] <- NA
data$death_rate[data$death_rate == "."] <- NA
data$age_adjusted_death_rate[data$age_adjusted_death_rate == "."] <- NA
# Convert columns to numeric using the cleaned column names
data$deaths <- as.numeric(gsub(",", "", data$deaths))
data$death_rate <- as.numeric(gsub(",", "", data$death_rate))
data$age_adjusted_death_rate <- as.numeric(gsub(",", "", data$age_adjusted_death_rate))
# Standardize the sex column
data$sex <- ifelse(data$sex %in% c("F", "Female"), "Female",
ifelse(data$sex %in% c("M", "Male"), "Male", data$sex))
# Separate Race and Ethnicity and remove the original race_ethnicity column
data <- data %>%
mutate(ethnicity = ifelse(grepl("Hispanic", race_ethnicity), "Hispanic", "Non-Hispanic"),
race = gsub("Hispanic|Non-Hispanic", "", race_ethnicity)) %>%
select(-race_ethnicity)
# Clean up race column by trimming whitespace
data$race <- trimws(data$race)
# Remove leading/trailing whitespace from character column values
char_columns <- sapply(data, is.character)
data[char_columns] <- lapply(data[char_columns], trimws)
# View the cleaned data to confirm
head(data)
## year leading_cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## sex deaths death_rate age_adjusted_death_rate ethnicity
## 1 Female 83 7.9 6.9 Hispanic
## 2 Female 96 8.0 8.1 Hispanic
## 3 Female 155 12.9 16.0 Hispanic
## 4 Female 1445 122.3 160.7 Hispanic
## 5 Female 14 2.5 3.6 Non-Hispanic
## 6 Female 36 6.8 8.5 Non-Hispanic
## race
## 1 Black
## 2
## 3
## 4
## 5 Asian and Pacific Islander
## 6 Asian and Pacific Islander
# Summary statistics
summary(data$deaths)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 25.0 140.0 429.3 317.2 7050.0 138
summary(data$death_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.40 12.84 20.10 56.26 78.90 491.40 729
summary(data$age_adjusted_death_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.50 12.34 19.80 53.61 81.50 414.59 729
# Histogram of Deaths
ggplot(data, aes(x=deaths)) +
geom_histogram(binwidth=50, fill="skyblue", color="black") +
labs(title="Distribution of Deaths", x="Number of Deaths", y="Frequency")
## Warning: Removed 138 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Death Rate by Year and Race
ggplot(data, aes(x=year, y=death_rate, color=race)) +
geom_line() +
labs(title="Death Rate Over Years by Race", x="Year", y="Death Rate")
## Warning: Removed 729 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Get top 10 leading causes by average age-adjusted death rate
top_causes <- data %>%
group_by(leading_cause) %>%
summarize(avg_death_rate = mean(age_adjusted_death_rate, na.rm = TRUE)) %>%
arrange(desc(avg_death_rate)) %>%
slice_head(n = 10) %>% # Change to 20 for top 20
pull(leading_cause)
# Filter data to include only the top 10 leading causes
filtered_data <- data %>%
filter(leading_cause %in% top_causes)
# Plot the top 10 leading causes by sex
ggplot(filtered_data, aes(x=leading_cause, y=age_adjusted_death_rate, fill=sex)) +
geom_boxplot() +
theme_minimal(base_size = 14) + # Minimal theme for clean look
theme(
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1, size = 10), # Adjust x-axis labels
axis.text.y = element_text(size = 12), # Larger y-axis labels
axis.title.x = element_text(size = 16, face = "bold"),
axis.title.y = element_text(size = 16, face = "bold"),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
legend.position = "top",
plot.margin = unit(c(1, 1, 1, 1), "lines") # Adjust margins
) +
labs(
title="Top 10 Leading Causes of Age Adjusted Death Rate by Gender",
x="Leading Cause",
y="Age Adjusted Death Rate"
) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + # Wrap longer labels
scale_fill_brewer(palette = "Set2") # Use color palette
## Warning: Removed 353 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
In the final project, a more detailed analysis will be conducted to address the research question comprehensively. This will include providing in-depth conclusions, identifying limitations, and offering informed suggestions for future research.