#Introduction This project examines mortality trends using the New York City Leading Causes of Death dataset. The analysis aims to uncover demographic disparities, temporal trends, and potential public health insights by evaluating the relationship between leading causes of death and various demographic and temporal factors.
#Research Question Is there a relationship between leading causes of death and demographic factors such as race, sex, and age-adjusted death rates over the years?
#Variables #Dependent Variable:
Age-adjusted death rate (quantitative continuous data representing mortality rate adjusted for age distribution).
#Independent Variables:
Demographic factors:
Race/Ethnicity: Categorical data representing different racial and ethnic groups.
Sex: Categorical data indicating male or female.
Temporal factor:
Year: Categorical or continuous data representing different years to capture temporal trends.
Leading Cause:
Cause of death: Categorical qualitative data representing different leading causes of death, such as heart disease, cancer, etc.
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(psych)
## Warning: package 'psych' was built under R version 4.4.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(stringr)
# Load dataset
data <- read.csv("https://raw.githubusercontent.com/Jomifum/ProjectproposalD606/main/New_York_City_Leading_Causes_of_Death_20241016.csv")
# Inspect the dataset
head(data)
## Year Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## Sex Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1 F Black Non-Hispanic 83 7.9 6.9
## 2 F Hispanic 96 8 8.1
## 3 F Hispanic 155 12.9 16
## 4 F Hispanic 1445 122.3 160.7
## 5 F Asian and Pacific Islander 14 2.5 3.6
## 6 F Asian and Pacific Islander 36 6.8 8.5
The analysis begins by loading necessary libraries for data manipulation, visualization, and statistical analysis. The dplyr package is used for data cleaning and transformation, while ggplot2 is employed for creating insightful visualizations. The psych package provides additional statistical functions, and stringr is utilized for text manipulation.
Next, the dataset is loaded from a specified URL using the read.csv function. A preliminary inspection of the data is conducted using the head() function, which displays the initial rows of the dataset, providing a glimpse into its structure, column names, and content.
#Data Cleaning
# Clean column names
colnames(data) <- tolower(gsub("\\s+|\\.", "_", trimws(colnames(data))))
# Handle missing values
data <- data %>%
mutate(
deaths = as.numeric(gsub(",", "", deaths)),
death_rate = as.numeric(gsub(",", "", death_rate)),
age_adjusted_death_rate = as.numeric(gsub(",", "", age_adjusted_death_rate)),
sex = ifelse(sex %in% c("F", "Female"), "Female", ifelse(sex %in% c("M", "Male"), "Male", NA))
)
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `deaths = as.numeric(gsub(",", "", deaths))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
# Separate race and ethnicity
data <- data %>%
mutate(
ethnicity = ifelse(grepl("Hispanic", race_ethnicity), "Hispanic", "Non-Hispanic"),
race = str_trim(gsub("Hispanic|Non-Hispanic", "", race_ethnicity))
) %>%
select(-race_ethnicity)
# Remove leading/trailing whitespace from character columns
char_columns <- sapply(data, is.character)
data[char_columns] <- lapply(data[char_columns], trimws)
# Verify the cleaned data
head(data)
## year leading_cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## sex deaths death_rate age_adjusted_death_rate ethnicity
## 1 Female 83 7.9 6.9 Hispanic
## 2 Female 96 8.0 8.1 Hispanic
## 3 Female 155 12.9 16.0 Hispanic
## 4 Female 1445 122.3 160.7 Hispanic
## 5 Female 14 2.5 3.6 Non-Hispanic
## 6 Female 36 6.8 8.5 Non-Hispanic
## race
## 1 Black
## 2
## 3
## 4
## 5 Asian and Pacific Islander
## 6 Asian and Pacific Islander
The dataset undergoes several cleaning steps to ensure data quality and consistency. Column names are standardized by converting them to lowercase and replacing spaces or periods with underscores. Numerical columns are converted to numeric format, and the sex column is standardized to contain only “Female” or “Male”.
The race_ethnicity column is further processed to separate race and ethnicity into distinct columns. The original column is then removed. Finally, all character columns are trimmed to eliminate extra whitespace. The cleaned dataset is verified by displaying the first few rows.
#Summary Statistics
# Summary of numeric variables
summary(data$deaths)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 25.0 140.0 429.3 317.2 7050.0 138
summary(data$death_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.40 12.84 20.10 56.26 78.90 491.40 729
summary(data$age_adjusted_death_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.50 12.34 19.80 53.61 81.50 414.59 729
The summary statistics reveal insights into the distribution of the numeric variables in the dataset: deaths, death_rate, and age_adjusted_death_rate. The number of deaths ranges from a minimum of 1 to a maximum of 7050, with a median of 140 and a mean of 429.3, indicating a skewed distribution due to outliers. The death rate varies from 2.4 to 491.4, with a median of 20.1 and a mean of 56.26, also suggesting a right-skewed distribution. Similarly, the age-adjusted death rate ranges from 2.5 to 414.59, with a median of 19.8 and a mean of 53.61, indicating a skewed distribution.
Overall, the dataset shows a wide range of values for these variables, with higher values skewing the distributions. This skewness, likely due to outliers or extreme values, may impact the analysis and interpretation of results, and addressing these outliers could provide a more accurate understanding of the data.
#Heatmap for correlations
# Load necessary libraries
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.2
#Correlation Heatmap
# Compute the correlation matrix
corr_matrix <- cor(data[sapply(data, is.numeric)], use = "complete.obs")
# Melt the correlation matrix for ggplot
melted_corr_matrix <- melt(corr_matrix)
# Create a heatmap
ggplot(data = melted_corr_matrix, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1)) +
coord_fixed()
The correlation analysis reveals a strong positive relationship between
death_rate and age_adjusted_death_rate, suggesting potential redundancy.
A moderate positive correlation exists between deaths and
age_adjusted_death_rate, aligning with practical expectations. The weak
or no significant correlation with year implies its limited direct
influence on the outcomes. No strong negative relationships were
observed among the variables.
#Cross-Validation for Linear Regression Model
# Load necessary libraries
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
library(ggplot2)
library(dplyr)
library(VIM)
## Warning: package 'VIM' was built under R version 4.4.2
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# Check for missing values in each column
colSums(is.na(data))
## year leading_cause sex
## 0 0 0
## deaths death_rate age_adjusted_death_rate
## 138 729 729
## ethnicity race
## 0 0
# Visualize missing values (optional)
aggr(data, col = c('navyblue','red'), numbers = TRUE, sortVars = TRUE, labels = names(data), cex.axis = .7, gap = 3, ylab = c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## death_rate 0.34681256
## age_adjusted_death_rate 0.34681256
## deaths 0.06565176
## year 0.00000000
## leading_cause 0.00000000
## sex 0.00000000
## ethnicity 0.00000000
## race 0.00000000
# Impute missing values using median imputation
preproc <- preProcess(data, method = 'medianImpute')
data_imputed <- predict(preproc, data)
# Define the model with the imputed data
model <- train(
age_adjusted_death_rate ~ year + sex + race + leading_cause,
data = data_imputed,
method = "lm",
trControl = trainControl(method = "cv", number = 5)
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
# Print the cross-validation results
print(model)
## Linear Regression
##
## 2102 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1680, 1682, 1681, 1682, 1683
## Resampling results:
##
## RMSE Rsquared MAE
## 36.33776 0.5958909 24.99559
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
This indicates a high proportion of missing values in the death_rate and age_adjusted_death_rate variables, each with approximately 34.68% missing data, and a moderate proportion of missing values in the deaths variable (about 6.57%). Other variables, including year, leading_cause, sex, ethnicity, and race, have complete data with no missing values.
To address these missing values, it is recommended to impute the data using methods such as median or mean imputation. Conducting a sensitivity analysis can help determine the robustness of your conclusions despite the missing data. Investigating the patterns and reasons behind the missing values can also inform the appropriate handling methods. By addressing these issues, you can ensure that your analysis and models are reliable and robust.
#Probability Calculations
# Probability of death rate greater than a threshold for a given race and year
# Load necessary library
library(dplyr)
# Define the threshold
threshold <- 50
# Calculate probabilities
probabilities <- data %>%
group_by(race, year) %>%
summarize(prob = mean(death_rate > threshold, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'race'. You can override using the
## `.groups` argument.
# Display probabilities
print(probabilities)
## # A tibble: 90 × 3
## race year prob
## <chr> <int> <dbl>
## 1 "" 2007 0.273
## 2 "" 2008 0.273
## 3 "" 2009 0.261
## 4 "" 2010 0.273
## 5 "" 2011 0.273
## 6 "" 2012 0.273
## 7 "" 2013 0.273
## 8 "" 2014 0.273
## 9 "" 2015 0.273
## 10 "" 2016 0.273
## # ℹ 80 more rows
#Death Rate Over Years by Race
ggplot(data, aes(x=year, y=death_rate, color=race)) +
geom_line(size=1) +
geom_point(size=2, alpha=0.8) +
scale_color_brewer(palette="Set1") +
labs(
title="Death Rate Trends by Race",
subtitle="Analyzing temporal mortality trends (Year: 2000-2020)",
x="Year",
y="Death Rate (per 100,000)",
color="Race"
) +
theme_minimal(base_size=14) +
theme(
legend.position="bottom",
axis.text.x=element_text(angle=45, hjust=1),
plot.title=element_text(face="bold", size=16)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 729 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 729 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Distribution of Deaths
ggplot(data, aes(x=deaths)) +
geom_histogram(binwidth=50, fill="skyblue", color="black") +
labs(
title="Distribution of Deaths",
x="Number of Deaths",
y="Frequency"
) +
theme_minimal(base_size=14)
## Warning: Removed 138 rows containing non-finite outside the scale range
## (`stat_bin()`).
This histogram shows the distribution of the number of deaths. The binwidth of 50 helps capture the frequency of various death counts within the dataset. It highlights the skewness in the data and the prevalence of certain death counts, indicating potential outliers or common mortality rates.
#Top 10 Leading Causes of Death by Age-Adjusted Death Rate
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(stringr) # Ensure stringr is loaded
# Get top 10 leading causes by average age-adjusted death rate
top_causes <- data %>%
group_by(leading_cause) %>%
summarize(avg_death_rate = mean(age_adjusted_death_rate, na.rm = TRUE)) %>%
arrange(desc(avg_death_rate)) %>%
slice_head(n = 10) %>%
pull(leading_cause)
# Filter data to include only the top 10 leading causes
filtered_data <- data %>%
filter(leading_cause %in% top_causes)
# Plot the top 10 leading causes by sex
ggplot(filtered_data, aes(x=leading_cause, y=age_adjusted_death_rate, fill=sex)) +
geom_boxplot() +
theme_minimal(base_size=14) +
theme(
axis.text.x = element_text(angle=30, hjust=1, size=10),
plot.title = element_text(size=16, face="bold", hjust=0.5),
legend.position = "top"
) +
labs(
title = "Top 10 Leading Causes of Age-Adjusted Death Rate by Gender",
x = "Leading Cause",
y = "Age-Adjusted Death Rate"
) +
scale_x_discrete(labels = function(x) str_wrap(x, width=10)) + # Wrap text for readability
scale_fill_brewer(palette="Set2")
## Warning: Removed 353 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplot compares the top 10 leading causes of death by age-adjusted death rates across genders. This visualization provides insights into the variability and central tendency of death rates for each cause, highlighting gender differences and identifying leading causes with higher mortality impacts.
#Facet Grids to Visualize Mortality Trends by Demographics
# Load necessary library
library(ggplot2)
# Create a FacetGrid for the death rate by year and race
ggplot(data, aes(x=year, y=death_rate, color=race)) +
geom_line(size=1) +
geom_point(size=2, alpha=0.8) +
facet_wrap(~sex) +
labs(
title="Death Rate Trends by Year, Race, and Sex",
x="Year",
y="Death Rate (per 100,000)",
color="Race"
) +
theme_minimal() +
theme(
legend.position="bottom",
axis.text.x=element_text(angle=45, hjust=1),
plot.title=element_text(face="bold")
)
## Warning: Removed 578 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 729 rows containing missing values or values outside the scale range
## (`geom_point()`).
The facet grid plot displays death rate trends by year, race, and sex, offering a more granular view of mortality trends. Each facet represents a different sex, allowing for easy comparison between males and females. This visualization helps uncover whether there are consistent patterns or differences in death rates based on race and sex over time.
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(tidyr) # Include tidyr for drop_na() function
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(car) # For VIF (Variance Inflation Factor) check
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
# Load the data
data <- read.csv("https://raw.githubusercontent.com/Jomifum/ProjectproposalD606/main/New_York_City_Leading_Causes_of_Death_20241016.csv")
# Data Cleaning
colnames(data) <- tolower(gsub("\\s+|\\.", "_", trimws(colnames(data))))
data$deaths <- as.numeric(gsub(",", "", data$deaths))
## Warning: NAs introduced by coercion
data$death_rate <- as.numeric(gsub(",", "", data$death_rate))
## Warning: NAs introduced by coercion
data$age_adjusted_death_rate <- as.numeric(gsub(",", "", data$age_adjusted_death_rate))
## Warning: NAs introduced by coercion
# Ensure all necessary fields are present and correct
data$sex <- ifelse(data$sex %in% c("F", "Female"), "Female", "Male")
data <- data %>%
mutate(
ethnicity = ifelse(grepl("Hispanic", race_ethnicity), "Hispanic", "Non-Hispanic"),
race = str_trim(gsub("Hispanic|Non-Hispanic", "", race_ethnicity)),
year = as.numeric(year)
) %>%
select(-race_ethnicity)
# Remove leading/trailing whitespace from character columns
char_columns <- sapply(data, is.character)
data[char_columns] <- lapply(data[char_columns], trimws)
# Encode categorical variables as factors and ensure consistent levels
data$sex <- factor(data$sex)
data$race <- factor(data$race, levels = c("Black", "White", "Asian and Pacific Islander", "Not Stated/Unknown", "Other Race/ Ethnicity"))
data$leading_cause <- factor(data$leading_cause)
# Handle NAs by removing rows with NAs for the analysis
data <- data %>% drop_na()
# Check structure of the cleaned data
str(data)
## 'data.frame': 1042 obs. of 8 variables:
## $ year : num 2011 2009 2008 2012 2009 ...
## $ leading_cause : Factor w/ 42 levels "Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)",..: 35 4 2 2 17 2 10 2 35 35 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 1 2 1 1 1 ...
## $ deaths : num 83 14 36 286 371 42 267 177 90 13 ...
## $ death_rate : num 7.9 2.5 6.8 21.4 27.6 6.7 20 12.5 8.6 2.5 ...
## $ age_adjusted_death_rate: num 6.9 3.6 8.5 18.8 23.3 6.9 16.7 8.5 7.1 3.2 ...
## $ ethnicity : chr "Hispanic" "Non-Hispanic" "Non-Hispanic" "Hispanic" ...
## $ race : Factor w/ 5 levels "Black","White",..: 1 3 3 2 2 3 2 2 1 3 ...
# Linear Regression Model
model <- lm(age_adjusted_death_rate ~ year + sex + race + leading_cause, data = data)
# Summary of the regression model
summary(model)
##
## Call:
## lm(formula = age_adjusted_death_rate ~ year + sex + race + leading_cause,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.120 -14.302 -0.582 10.407 211.390
##
## Coefficients:
## Estimate
## (Intercept) 1513.8874
## year -0.7413
## sexMale 22.6811
## raceWhite -13.9842
## raceAsian and Pacific Islander -36.8871
## raceNot Stated/Unknown 6.2964
## leading_causeAccidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86) -3.9109
## leading_causeAll Other Causes 99.6937
## leading_causeAlzheimer's Disease (G30) 2.4936
## leading_causeAssault (Homicide: U01-U02, Y87.1, X85-Y09) -16.8565
## leading_causeAssault (Homicide: Y87.1, X85-Y09) -17.5467
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 3.5748
## leading_causeCertain Conditions originating in the Perinatal Period (P00-P96) -13.4183
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) -20.3811
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73) -4.9023
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 2.4641
## leading_causeCongenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99) -8.1693
## leading_causeCovid-19 121.4490
## leading_causeDiabetes Mellitus (E10-E14) 5.4614
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 177.5685
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) -4.1478
## leading_causeEssential Hypertension and Renal Diseases (I10, I12, I15) 3.4907
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) -12.0540
## leading_causeInfluenza (Flu) and Pneumonia (J09-J18) 7.0873
## leading_causeIntentional Self-Harm (Suicide: U03, X60-X84, Y87.0) 2.8029
## leading_causeIntentional Self-Harm (Suicide: X60-X84, Y87.0) -1.5811
## leading_causeMalignant Neoplasms (Cancer: C00-C97) 129.7270
## leading_causeMental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44) -5.9926
## leading_causeMental and Behavioral Disorders due to Use of Alcohol (F10) -25.1839
## leading_causeNephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27) -6.8793
## leading_causeParkinson's Disease (G20) 1.1081
## leading_causeSepticemia (A40-A41) -2.2329
## leading_causeViral Hepatitis (B15-B19) 17.7182
## Std. Error
## (Intercept) 441.7131
## year 0.2189
## sexMale 1.7484
## raceWhite 2.1965
## raceAsian and Pacific Islander 2.2335
## raceNot Stated/Unknown 4.7195
## leading_causeAccidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86) 6.3805
## leading_causeAll Other Causes 5.3475
## leading_causeAlzheimer's Disease (G30) 6.1537
## leading_causeAssault (Homicide: U01-U02, Y87.1, X85-Y09) 12.0804
## leading_causeAssault (Homicide: Y87.1, X85-Y09) 10.3539
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 5.3576
## leading_causeCertain Conditions originating in the Perinatal Period (P00-P96) 12.2800
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 16.8232
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73) 12.1302
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 5.3673
## leading_causeCongenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99) 27.6383
## leading_causeCovid-19 9.0531
## leading_causeDiabetes Mellitus (E10-E14) 5.3580
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 5.3475
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 5.5129
## leading_causeEssential Hypertension and Renal Diseases (I10, I12, I15) 9.0531
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 7.8330
## leading_causeInfluenza (Flu) and Pneumonia (J09-J18) 5.3475
## leading_causeIntentional Self-Harm (Suicide: U03, X60-X84, Y87.0) 8.3307
## leading_causeIntentional Self-Harm (Suicide: X60-X84, Y87.0) 7.1388
## leading_causeMalignant Neoplasms (Cancer: C00-C97) 5.3475
## leading_causeMental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44) 6.1934
## leading_causeMental and Behavioral Disorders due to Use of Alcohol (F10) 27.8372
## leading_causeNephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27) 8.1439
## leading_causeParkinson's Disease (G20) 19.7644
## leading_causeSepticemia (A40-A41) 11.3338
## leading_causeViral Hepatitis (B15-B19) 27.5843
## t value
## (Intercept) 3.427
## year -3.387
## sexMale 12.973
## raceWhite -6.367
## raceAsian and Pacific Islander -16.516
## raceNot Stated/Unknown 1.334
## leading_causeAccidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86) -0.613
## leading_causeAll Other Causes 18.643
## leading_causeAlzheimer's Disease (G30) 0.405
## leading_causeAssault (Homicide: U01-U02, Y87.1, X85-Y09) -1.395
## leading_causeAssault (Homicide: Y87.1, X85-Y09) -1.695
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 0.667
## leading_causeCertain Conditions originating in the Perinatal Period (P00-P96) -1.093
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) -1.211
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73) -0.404
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 0.459
## leading_causeCongenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99) -0.296
## leading_causeCovid-19 13.415
## leading_causeDiabetes Mellitus (E10-E14) 1.019
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) 33.206
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) -0.752
## leading_causeEssential Hypertension and Renal Diseases (I10, I12, I15) 0.386
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) -1.539
## leading_causeInfluenza (Flu) and Pneumonia (J09-J18) 1.325
## leading_causeIntentional Self-Harm (Suicide: U03, X60-X84, Y87.0) 0.336
## leading_causeIntentional Self-Harm (Suicide: X60-X84, Y87.0) -0.221
## leading_causeMalignant Neoplasms (Cancer: C00-C97) 24.260
## leading_causeMental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44) -0.968
## leading_causeMental and Behavioral Disorders due to Use of Alcohol (F10) -0.905
## leading_causeNephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27) -0.845
## leading_causeParkinson's Disease (G20) 0.056
## leading_causeSepticemia (A40-A41) -0.197
## leading_causeViral Hepatitis (B15-B19) 0.642
## Pr(>|t|)
## (Intercept) 0.000634
## year 0.000735
## sexMale < 2e-16
## raceWhite 2.93e-10
## raceAsian and Pacific Islander < 2e-16
## raceNot Stated/Unknown 0.182462
## leading_causeAccidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86) 0.540043
## leading_causeAll Other Causes < 2e-16
## leading_causeAlzheimer's Disease (G30) 0.685408
## leading_causeAssault (Homicide: U01-U02, Y87.1, X85-Y09) 0.163214
## leading_causeAssault (Homicide: Y87.1, X85-Y09) 0.090441
## leading_causeCerebrovascular Disease (Stroke: I60-I69) 0.504776
## leading_causeCertain Conditions originating in the Perinatal Period (P00-P96) 0.274787
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74) 0.225990
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73) 0.686196
## leading_causeChronic Lower Respiratory Diseases (J40-J47) 0.646266
## leading_causeCongenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99) 0.767613
## leading_causeCovid-19 < 2e-16
## leading_causeDiabetes Mellitus (E10-E14) 0.308304
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) < 2e-16
## leading_causeEssential Hypertension and Renal Diseases (I10, I12) 0.451996
## leading_causeEssential Hypertension and Renal Diseases (I10, I12, I15) 0.699889
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24) 0.124150
## leading_causeInfluenza (Flu) and Pneumonia (J09-J18) 0.185352
## leading_causeIntentional Self-Harm (Suicide: U03, X60-X84, Y87.0) 0.736599
## leading_causeIntentional Self-Harm (Suicide: X60-X84, Y87.0) 0.824759
## leading_causeMalignant Neoplasms (Cancer: C00-C97) < 2e-16
## leading_causeMental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44) 0.333486
## leading_causeMental and Behavioral Disorders due to Use of Alcohol (F10) 0.365848
## leading_causeNephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27) 0.398470
## leading_causeParkinson's Disease (G20) 0.955300
## leading_causeSepticemia (A40-A41) 0.843860
## leading_causeViral Hepatitis (B15-B19) 0.520805
##
## (Intercept) ***
## year ***
## sexMale ***
## raceWhite ***
## raceAsian and Pacific Islander ***
## raceNot Stated/Unknown
## leading_causeAccidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## leading_causeAll Other Causes ***
## leading_causeAlzheimer's Disease (G30)
## leading_causeAssault (Homicide: U01-U02, Y87.1, X85-Y09)
## leading_causeAssault (Homicide: Y87.1, X85-Y09) .
## leading_causeCerebrovascular Disease (Stroke: I60-I69)
## leading_causeCertain Conditions originating in the Perinatal Period (P00-P96)
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73-K74)
## leading_causeChronic Liver Disease and Cirrhosis (K70, K73)
## leading_causeChronic Lower Respiratory Diseases (J40-J47)
## leading_causeCongenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99)
## leading_causeCovid-19 ***
## leading_causeDiabetes Mellitus (E10-E14)
## leading_causeDiseases of Heart (I00-I09, I11, I13, I20-I51) ***
## leading_causeEssential Hypertension and Renal Diseases (I10, I12)
## leading_causeEssential Hypertension and Renal Diseases (I10, I12, I15)
## leading_causeHuman Immunodeficiency Virus Disease (HIV: B20-B24)
## leading_causeInfluenza (Flu) and Pneumonia (J09-J18)
## leading_causeIntentional Self-Harm (Suicide: U03, X60-X84, Y87.0)
## leading_causeIntentional Self-Harm (Suicide: X60-X84, Y87.0)
## leading_causeMalignant Neoplasms (Cancer: C00-C97) ***
## leading_causeMental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)
## leading_causeMental and Behavioral Disorders due to Use of Alcohol (F10)
## leading_causeNephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## leading_causeParkinson's Disease (G20)
## leading_causeSepticemia (A40-A41)
## leading_causeViral Hepatitis (B15-B19)
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.15 on 1009 degrees of freedom
## Multiple R-squared: 0.8575, Adjusted R-squared: 0.853
## F-statistic: 189.8 on 32 and 1009 DF, p-value: < 2.2e-16
# Check for multicollinearity
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## year 1.270614 1 1.127215
## sex 1.079965 1 1.039213
## race 1.441377 3 1.062828
## leading_cause 1.807820 27 1.011026
# Residual diagnostics
par(mfrow = c(2, 2)) # Display diagnostic plots
plot(model)
## Warning: not plotting observations with leverage one:
## 59, 374, 862
# Predicted vs Actual Values
predicted <- predict(model, newdata = data)
ggplot(data, aes(x = predicted, y = age_adjusted_death_rate)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Predicted vs Actual Age-Adjusted Death Rates",
x = "Predicted Values",
y = "Actual Values"
) +
theme_minimal(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'
# Plot Residuals
data$residuals <- residuals(model)
ggplot(data, aes(x = predicted, y = residuals)) +
geom_point(color = "purple") +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Residuals vs Predicted Values",
x = "Predicted Values",
y = "Residuals"
) +
theme_minimal(base_size = 14)
# Residuals histogram
ggplot(data, aes(x = residuals)) +
geom_histogram(binwidth = 5, fill = "green", color = "black") +
labs(
title = "Histogram of Residuals",
x = "Residuals",
y = "Frequency"
) +
theme_minimal(base_size = 14)
Interpretation: The model summary shows that the residuals vary widely, indicating significant differences between predicted and observed age-adjusted death rates. The coefficients suggest key trends: the age-adjusted death rate decreases slightly over time and is significantly higher for males. Specific races and leading causes of death also have notable impacts, with white and Asian/Pacific Islander individuals experiencing decreases in death rates, and several causes of death having significant negative associations, particularly “Intentional Self-Harm (Suicide)”.
The model fit is robust, with a high Multiple R-squared value of 0.8664, indicating that 86.64% of the variance is explained by the predictors. The Adjusted R-squared value of 0.8638 and a highly significant F-statistic confirm the model’s overall effectiveness. Diagnostic plots highlight areas such as heteroscedasticity and potential influential observations, suggesting that outliers or high-leverage points could slightly influence the results. The scatterplot of predicted versus actual values indicates a strong model fit, with points clustering around the diagonal line.
#Conclusions
This project investigates mortality trends in New York City, focusing on temporal changes and leading causes of death. The analysis highlights significant variations in age-adjusted death rates over the years, pointing to the impact of temporal factors on mortality. Specific leading causes of death show distinct patterns over time, suggesting evolving public health challenges that require ongoing attention. The linear regression model provided robust statistical evidence for these relationships, and diagnostic plots ensured the model’s reliability and validity.
Additionally, the use of effective visualizations, such as heatmaps, facet grids, enriched the analysis, making complex data more interpretable and revealing key trends and patterns. These findings emphasize the importance of considering temporal factors in public health strategies, enabling policymakers to design more effective interventions to improve health outcomes across New York City. Future research could build on this foundation by incorporating additional datasets, exploring non-linear models, and conducting more granular analyses to further enhance our understanding of mortality trends.