For this project, the New York City Leading Causes of Death dataset was utilized. This dataset is available on the City of New York’s official website. The following sections outline the reasons for selecting this particular dataset for our analysis.

Data Preparation

In this section we prepare the data:

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
# Load data
# Assuming the dataset is loaded as a CSV file from your GitHub link
data <- read.csv("https://raw.githubusercontent.com/Jomifum/ProjectproposalD606/main/New_York_City_Leading_Causes_of_Death_20241016.csv")


# View the first few rows of the dataset
head(data)
##   Year                                                           Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009                            Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008                          Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009                                               Alzheimer's Disease (G30)
## 6 2008        Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
##   Sex             Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1   F         Black Non-Hispanic     83        7.9                     6.9
## 2   F                   Hispanic     96          8                     8.1
## 3   F                   Hispanic    155       12.9                      16
## 4   F                   Hispanic   1445      122.3                   160.7
## 5   F Asian and Pacific Islander     14        2.5                     3.6
## 6   F Asian and Pacific Islander     36        6.8                     8.5

#Research question Is there a relationship between leading causes of death and demographic factors such as race, sex, and age-adjusted death rates over the years?

#Cases What are the cases, and how many are there?

Each case represents a specific cause of death, gender, and racial/ethnic group across various years. There are nrow(data) observations in the dataset.

#Data collection Describe the method of data collection.

The data was collected from public health records detailing causes of death across different demographic groups, including year, gender, race/ethnicity, and calculated age-adjusted death rates.

#Type of study What type of study is this?

This is an observational study.

#Data Source This data was collected from New York State Department of Health at: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data

Also the data source is publicly accessible on GitHub at: https://github.com/Jomifum/ProjectproposalD606/blob/main/New_York_City_Leading_Causes_of_Death_20241016.csv.

#Variables Description: Year: The year in which the deaths were recorded (quantitative). Leading Cause: The primary cause of death (qualitative). Sex: Gender of the deceased (qualitative). Race Ethnicity: Racial/ethnic group classification (qualitative). Deaths: Number of deaths for the given year, cause, and demographic group (quantitative). Death Rate: Death rate per 100,000 population (quantitative). Age Adjusted Death Rate: Age-adjusted death rate per 100,000 population (quantitative). If running a regression, the dependent variable would likely be ‘Age Adjusted Death Rate.’

#Relevant summary statistics Provide summary statistics for each of the variables. Include visualizations related to the research question.

##Cleaning data

In this section the data is cleaned and tidied to perform statistics and visualizations analysis:

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(psych)
library(stringr)

# Load data
data <- read.csv("https://raw.githubusercontent.com/Jomifum/ProjectproposalD606/main/New_York_City_Leading_Causes_of_Death_20241016.csv")

# View the first few rows of the dataset
head(data)
##   Year                                                           Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009                            Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008                          Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009                                               Alzheimer's Disease (G30)
## 6 2008        Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
##   Sex             Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1   F         Black Non-Hispanic     83        7.9                     6.9
## 2   F                   Hispanic     96          8                     8.1
## 3   F                   Hispanic    155       12.9                      16
## 4   F                   Hispanic   1445      122.3                   160.7
## 5   F Asian and Pacific Islander     14        2.5                     3.6
## 6   F Asian and Pacific Islander     36        6.8                     8.5
# Remove any leading or trailing whitespace from column names
colnames(data) <- trimws(colnames(data))

# Convert column names to lowercase and replace spaces with underscores
colnames(data) <- tolower(colnames(data))
colnames(data) <- gsub("\\s+", "_", colnames(data))
colnames(data) <- gsub("\\.", "_", colnames(data))  # Replace dots with underscores

# Re-inspect column names to ensure they’re clean
print(colnames(data))
## [1] "year"                    "leading_cause"          
## [3] "sex"                     "race_ethnicity"         
## [5] "deaths"                  "death_rate"             
## [7] "age_adjusted_death_rate"
# Handle missing values indicated by "." in numeric columns by replacing them with NA
data$deaths[data$deaths == "."] <- NA
data$death_rate[data$death_rate == "."] <- NA
data$age_adjusted_death_rate[data$age_adjusted_death_rate == "."] <- NA

# Convert columns to numeric using the cleaned column names
data$deaths <- as.numeric(gsub(",", "", data$deaths))
data$death_rate <- as.numeric(gsub(",", "", data$death_rate))
data$age_adjusted_death_rate <- as.numeric(gsub(",", "", data$age_adjusted_death_rate))

# Standardize the sex column
data$sex <- ifelse(data$sex %in% c("F", "Female"), "Female", 
                   ifelse(data$sex %in% c("M", "Male"), "Male", data$sex))

# Separate Race and Ethnicity and remove the original race_ethnicity column
data <- data %>%
  mutate(ethnicity = ifelse(grepl("Hispanic", race_ethnicity), "Hispanic", "Non-Hispanic"),
         race = gsub("Hispanic|Non-Hispanic", "", race_ethnicity)) %>%
  select(-race_ethnicity)

# Clean up race column by trimming whitespace
data$race <- trimws(data$race)

# Remove leading/trailing whitespace from character column values
char_columns <- sapply(data, is.character)
data[char_columns] <- lapply(data[char_columns], trimws)

# View the cleaned data to confirm
head(data)
##   year                                                           leading_cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009                            Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008                          Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009                                               Alzheimer's Disease (G30)
## 6 2008        Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
##      sex deaths death_rate age_adjusted_death_rate    ethnicity
## 1 Female     83        7.9                     6.9     Hispanic
## 2 Female     96        8.0                     8.1     Hispanic
## 3 Female    155       12.9                    16.0     Hispanic
## 4 Female   1445      122.3                   160.7     Hispanic
## 5 Female     14        2.5                     3.6 Non-Hispanic
## 6 Female     36        6.8                     8.5 Non-Hispanic
##                         race
## 1                      Black
## 2                           
## 3                           
## 4                           
## 5 Asian and Pacific Islander
## 6 Asian and Pacific Islander

Statistics analysis and Visualizations:

# Summary statistics
summary(data$deaths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    25.0   140.0   429.3   317.2  7050.0     138
summary(data$death_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.40   12.84   20.10   56.26   78.90  491.40     729
summary(data$age_adjusted_death_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.50   12.34   19.80   53.61   81.50  414.59     729
# Histogram of Deaths
ggplot(data, aes(x=deaths)) + 
  geom_histogram(binwidth=50, fill="skyblue", color="black") +
  labs(title="Distribution of Deaths", x="Number of Deaths", y="Frequency")
## Warning: Removed 138 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Death Rate by Year and Race
ggplot(data, aes(x=year, y=death_rate, color=race)) + 
  geom_line() + 
  labs(title="Death Rate Over Years by Race", x="Year", y="Death Rate")
## Warning: Removed 729 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Get top 10 leading causes by average age-adjusted death rate
top_causes <- data %>%
  group_by(leading_cause) %>%
  summarize(avg_death_rate = mean(age_adjusted_death_rate, na.rm = TRUE)) %>%
  arrange(desc(avg_death_rate)) %>%
  slice_head(n = 10) %>%  # Change to 20 for top 20
  pull(leading_cause)

# Filter data to include only the top 10 leading causes
filtered_data <- data %>%
  filter(leading_cause %in% top_causes)

# Plot the top 10 leading causes by sex
ggplot(filtered_data, aes(x=leading_cause, y=age_adjusted_death_rate, fill=sex)) + 
  geom_boxplot() + 
  theme_minimal(base_size = 14) +  # Minimal theme for clean look
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1, size = 10),  # Adjust x-axis labels
    axis.text.y = element_text(size = 12),  # Larger y-axis labels
    axis.title.x = element_text(size = 16, face = "bold"),
    axis.title.y = element_text(size = 16, face = "bold"),
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    legend.position = "top",
    plot.margin = unit(c(1, 1, 1, 1), "lines")  # Adjust margins
  ) +
  labs(
    title="Top 10 Leading Causes of Age Adjusted Death Rate by Gender",
    x="Leading Cause",
    y="Age Adjusted Death Rate"
  ) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +  # Wrap longer labels
  scale_fill_brewer(palette = "Set2")  # Use color palette
## Warning: Removed 353 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

In the final project, a more detailed analysis will be conducted to address the research question comprehensively. This will include providing in-depth conclusions, identifying limitations, and offering informed suggestions for future research.