For my final project, I will be analyzing data on maternal mortality specifically NYC counts and global rates. The motivation for my project is my background in public health from my BS, MPH, and current job - maternal mortality as we will see later, are prominent in developing and non-developed countries. Furthermore, I was born and raised in NYC, and it is unfortunate that we have very high rates of maternal mortality. Disparities in maternal mortality dive deeper, since mortality counts differ by race and ethnicity with POC having higher counts.
Using the NYC DOHMH “Pregnancy-Associated Mortality,” I will create some visualization of the data, then do an ANOVA for difference in race/ethnicity and boroughs in NYC, as well as create a table summary. Then I will look at the global “Maternal Mortality Dataset,” original taken from the Human Development Reports which looks at maternal mortality by country to create some visualizations of the data to compare rates across different continents. I lastly compared NYC rates with the continents from the global dataset in a bar graph.
Final Project:
* Data sources:
* NYC Dataset: https://data.cityofnewyork.us/Health/Pregnancy-Associated-Mortality/27x4-cbi6/data
* Global Dataset: https://www.kaggle.com/datasets/iamsouravbanerjee/maternal-mortality-dataset?resource=download&select=Maternal+Mortality.csv
* Data acquired: NYC (API) and Global (csv.)
* Transformations: Tidy (pivot longer)
* Clean-up are performed: Remove NAs and other rows for analysis
* Analysis and Presentation: Bar plots, summary statistics, ANOVA
* Feature not covered in class: gt_summary, flexable
* Challenge: tbl_summary (Recreate NYC Maternal Mortality Annual
Report)
Use jsonline (fromJSON) to retrieve the API, import the NYC maternal mortality data, then change it to a dataset.
url <- "https://data.cityofnewyork.us/resource/27x4-cbi6.json"
response <- GET(url)
data <- fromJSON(content(response, "text"))
datatable(data)
mm_world <- read_excel("C:/Users/Kesha/Desktop/607/maternal_mortality.xlsx")
datatable(mm_world)
For the global dataset, I used a pivot_longer since the dataset contained variables of maternal mortailty ratios by year. Then I removed NAs and only kept year 2017 to 2021 to match the NYC dataset.
mmworld_long <- mm_world %>%
pivot_longer(cols = starts_with("Maternal Mortality Ratio"),
names_to = c("Year"),
names_pattern = ".*\\((\\d{4})\\)$",
values_to = "Ratio")
datatable(mmworld_long)
mm_world_1 <- mmworld_long %>%
clean_names()%>%
filter(!is.na(ratio)) %>%
filter(year >= 2016 & year <= 2021)
Here I did an aggregate mean of the continents, as well as a box plot and bar graph by continent
aggregate(ratio ~ continent, data = mm_world_1, FUN = mean)
## continent ratio
## 1 Africa 414.651235
## 2 America 79.015152
## 3 Asia 75.399306
## 4 Europe 7.803419
## 5 Oceania 64.766667
ggplot(mm_world_1, aes(x = continent, y = ratio, fill = continent)) +
geom_boxplot() +
labs(title = "Rate of Maternal Mortality per 100,000 by Continent", x = "Continent", y = "Rate per 100,000")
ggplot(mm_world_1, aes(x = year, y = ratio, fill = continent)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set3") +
labs(title = "Maternal Mortality Rates by Year and Continent",
x = "MM Year",
y = "MM Ratio per 100,000",
color = "Continent") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(mmworld_long, aes(x = Year, y = Ratio, fill = Continent)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Paired") +
labs(title = "Maternal Mortality Rates by Year and Continent",
x = "MM Year",
y = "MM Ratio per 100,000",
color = "Continent") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 352 rows containing missing values or values outside the scale range
## (`geom_bar()`).
Remove NAs
data$deaths <- as.numeric(data$deaths)
data <- data %>%
filter(year != "Year")
Here I am recreating the NYC Maternal mortality Annual Report but using data from 2021. Then I am using all the data from the data set and creating a table summary by year to export into a word document. Similar to what you would create as a table 1 for a journal article.
nyc_data2 <- data %>%
filter(year== "2021")
count_data <- uncount(nyc_data2, deaths)
count_data2 <- uncount(data, deaths)
datatable(count_data2)
count_data %>%
select(related, race_ethnicity, borough, underlying_cause) %>%
tbl_summary(
statistic = all_categorical() ~ "{n} ({p}%)",
label = list(related ~ "Pregnancy-relatedness",
race_ethnicity ~ "Race/ethnicity",
borough ~ "Borough of residence",
underlying_cause ~ "Underlying Cause of Death"))
Characteristic | N = 2281 |
---|---|
Pregnancy-relatedness | |
All | 116 (51%) |
Pregnancy-associated but not related | 8 (3.5%) |
Pregnancy-related | 89 (39%) |
Unable to Determine | 15 (6.6%) |
Race/ethnicity | |
All | 140 (61%) |
Asian/Pacific Islander | 12 (5.3%) |
Black non-Latina | 34 (15%) |
Latina | 32 (14%) |
Other | 2 (0.9%) |
White non-Latina | 8 (3.5%) |
Borough of residence | |
All | 140 (61%) |
Bronx | 21 (9.2%) |
Brooklyn | 23 (10%) |
Manhattan | 15 (6.6%) |
Queens | 14 (6.1%) |
Rest of State | 11 (4.8%) |
Staten Island | 4 (1.8%) |
Underlying Cause of Death | |
All | 176 (77%) |
Asthma/ pulmonary conditions | 1 (0.4%) |
Cancer | 5 (2.2%) |
Cardiovascular Conditions | 2 (0.9%) |
Embolism | 2 (0.9%) |
Hemorrhage | 4 (1.8%) |
Infection/Sepsis | 5 (2.2%) |
Mental Health Conditions (Overdose related to substance use disorder) | 20 (8.8%) |
Mental Health Conditions (Suicide) | 1 (0.4%) |
Other | 10 (4.4%) |
Unknown COD | 2 (0.9%) |
1 n (%) |
table1 <- count_data2 %>%
select(year, related, race_ethnicity, borough, underlying_cause) %>%
tbl_summary(by = year,
statistic = all_categorical() ~ "{n} ({p}%)",
label = list(related ~ "Pregnancy-relatedness",
race_ethnicity ~ "Race/ethnicity",
borough ~ "Borough of residence",
underlying_cause ~ "Underlying Cause of Death"))
flex <- table1 %>%
as_flex_table() %>%
theme_box()
read_docx() %>%
body_add_flextable(flex) %>%
body_end_section_landscape() %>%
print("flex.docx")
This analysis aims to examine the relationship between race and ethnicity and deaths in NYC using data from 2016-2021.
Research Question: Is there a difference in death means for those who identify as Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina?
H0: There is no difference in death means for those who identify as Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina.
HA: There is a difference in death means for at least one of the race/ethnicity groups (Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina.
analysis <- data %>%
filter(race_ethnicity != "All")
analysis%>%
group_by(race_ethnicity) %>%
summarise(total_deaths = sum(deaths))
## # A tibble: 5 × 2
## race_ethnicity total_deaths
## <chr> <dbl>
## 1 Asian/Pacific Islander 42
## 2 Black non-Latina 193
## 3 Latina 129
## 4 Other 7
## 5 White non-Latina 72
analysis %>%
group_by(race_ethnicity) %>%
summarise(
n = sum(deaths),
mean_deaths = mean(deaths),
median_deaths = median(deaths),
sd_deaths = sd(deaths),
min_deaths = min(deaths),
max_deaths = max(deaths)
)
## # A tibble: 5 × 7
## race_ethnicity n mean_deaths median_deaths sd_deaths min_deaths max_deaths
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Asian/Pacific… 42 4.2 4.5 1.81 1 7
## 2 Black non-Lat… 193 16.1 14 6.29 6 26
## 3 Latina 129 10.8 9 5.79 2 21
## 4 Other 7 1.75 1.5 0.957 1 3
## 5 White non-Lat… 72 6 5.5 3.44 1 11
hist(analysis$deaths,
main = "Maternal Mortality Rates Distribution",
xlab = "Maternal Mortality Rates",
col = "darkgreen")
summary(analysis$deaths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 7.00 8.86 11.75 26.00
ggplot(analysis, aes(x = race_ethnicity, y = deaths, fill = race_ethnicity)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set4") +
labs(title = "Maternal Mortality Rates by Racial or Ethnic Group",
x = "Race/Ethnicity",
y = "Maternal Mortality Rates per 100,000 individuals") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Unknown palette: "Set4"
aov_result <- aov(deaths ~ race_ethnicity, data = analysis)
summary(aov_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## race_ethnicity 4 1186.5 296.63 13.82 1.97e-07 ***
## Residuals 45 965.5 21.46
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residuals from the model
residuals1 <- resid(aov_result)
# Histogram to check normality
hist(residuals1, main = "Histogram of Residuals",
xlab = "Residuals", col = "pink", border = "black",
breaks = 30)
# Normal probability plot to check normality
qqnorm(residuals1)
The p-value is 1.97e-07.
Since the p-value (1.97e-07) is less than α= 0.05, we reject the null hypothesis. There is a statistically significant difference in death means between those who identify as Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina.
This analysis aims to examine the relationship between borough and deaths in NYC using data from 2016-2021.
Research Question: Is there a difference in death means for those who live in Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island?
H0: There is no difference in death means for those who live in Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island.
HA: There is a difference in death means for at least one of the boroughs (Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island.)
analysis3 <- data %>%
filter(borough != "All")
analysis4 <- analysis3 %>%
group_by(borough) %>%
summarise(total_deaths = sum(deaths))
analysis3 %>%
group_by(borough) %>%
summarise(
n = sum(deaths),
mean_deaths = mean(deaths),
median_deaths = median(deaths),
sd_deaths = sd(deaths),
min_deaths = min(deaths),
max_deaths = max(deaths)
)
## # A tibble: 6 × 7
## borough n mean_deaths median_deaths sd_deaths min_deaths max_deaths
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bronx 101 8.42 8 4.36 1 16
## 2 Brooklyn 137 11.4 11 4.32 5 20
## 3 Manhattan 50 4.17 4 2.37 1 10
## 4 Queens 87 7.25 8 3.77 0 13
## 5 Rest of State 40 3.33 3 2.02 0 7
## 6 Staten Island 28 2.55 2 2.21 0 8
ggplot(analysis3, aes(x = borough, y = deaths, fill = borough)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set4") +
labs(title = "Maternal Mortality Rates by NYC Boroughs",
x = "Boroughs",
y = "Maternal Mortality Rates per 100,000 individuals") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Unknown palette: "Set4"
aov_result1 <- aov(deaths ~ borough, data = analysis3)
summary(aov_result1)
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 5 693.8 138.76 12.44 1.77e-08 ***
## Residuals 65 725.1 11.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residuals from the model
residuals <- resid(aov_result1)
# Histogram to check normality
hist(residuals, main = "Histogram of Residuals",
xlab = "Residuals", col = "coral", border = "black",
breaks = 30)
# Normal probability plot to check normality
qqnorm(residuals)
The p-value is 1.77e-08.
Since the p-value (1.77e-08) is less than α= 0.05, we reject the null hypothesis. There is a statistically significant difference in death means for at least one of the boroughs (Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island.)
data_combine <- data %>%
select(year, deaths) %>%
rename(ratio = deaths)
data_combine$location <- "NYC"
mm_world_combine <- mm_world_1 %>%
select(year, continent, ratio)%>%
rename(location = continent)
stacked_data <- bind_rows(data_combine, mm_world_combine)
datatable(stacked_data)
ggplot(stacked_data, aes(x = year, y = ratio, fill = location)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set3") +
labs(title = "Maternal Mortality Rates by Year and Location",
x = "MM Year",
y = "MM Ratio per 100,000",
color = "Location") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Using the NYC Maternal Mortality Dataset and the Global Maternal Mortality Dataset, retrieved using API and as a csv. I was able to use techniques of tidying data to perform visualizations and analysis. I also used gt_summary as a new feature to share with the class. Although this feature took me a little while to figure out and reproduce similar to the original, I found a work around that produced the results I wanted to. My statistical analysis showed that both borough and race/ethnicity has statistically significant difference in means, although my plots and residual plots showed that I did not meet the assumptions of ANOVA - therefore my data is not reliable. My last stacked dataset of NYC MM rates and Global MM rates by continent shows that NYC has very close rates to all of Europe.