Data 607: Final Project (Maternal Mortality)

Introduction

For my final project, I will be analyzing data on maternal mortality specifically NYC counts and global rates. The motivation for my project is my background in public health from my BS, MPH, and current job - maternal mortality as we will see later, are prominent in developing and non-developed countries. Furthermore, I was born and raised in NYC, and it is unfortunate that we have very high rates of maternal mortality. Disparities in maternal mortality dive deeper, since mortality counts differ by race and ethnicity with POC having higher counts.

Using the NYC DOHMH “Pregnancy-Associated Mortality,” I will create some visualization of the data, then do an ANOVA for difference in race/ethnicity and boroughs in NYC, as well as create a table summary. Then I will look at the global “Maternal Mortality Dataset,” original taken from the Human Development Reports which looks at maternal mortality by country to create some visualizations of the data to compare rates across different continents. I lastly compared NYC rates with the continents from the global dataset in a bar graph.

Final Project:
* Data sources:
* NYC Dataset: https://data.cityofnewyork.us/Health/Pregnancy-Associated-Mortality/27x4-cbi6/data
* Global Dataset: https://www.kaggle.com/datasets/iamsouravbanerjee/maternal-mortality-dataset?resource=download&select=Maternal+Mortality.csv
* Data acquired: NYC (API) and Global (csv.)
* Transformations: Tidy (pivot longer)
* Clean-up are performed: Remove NAs and other rows for analysis
* Analysis and Presentation: Bar plots, summary statistics, ANOVA
* Feature not covered in class: gt_summary, flexable
* Challenge: tbl_summary (Recreate NYC Maternal Mortality Annual Report)

Acquiring Data Sources

NYC Maternal Mortality Data 2016-2021

Use jsonline (fromJSON) to retrieve the API, import the NYC maternal mortality data, then change it to a dataset.

url <- "https://data.cityofnewyork.us/resource/27x4-cbi6.json"
response <- GET(url)

data <- fromJSON(content(response, "text"))

datatable(data)

Global Maternal Mortality Data 1990-2021

mm_world <- read_excel("C:/Users/Kesha/Desktop/607/maternal_mortality.xlsx")

datatable(mm_world)

Tidying and cleaning data source

For the global dataset, I used a pivot_longer since the dataset contained variables of maternal mortailty ratios by year. Then I removed NAs and only kept year 2017 to 2021 to match the NYC dataset.

mmworld_long <- mm_world %>%
  pivot_longer(cols = starts_with("Maternal Mortality Ratio"),
               names_to = c("Year"),
               names_pattern = ".*\\((\\d{4})\\)$",
               values_to = "Ratio")

datatable(mmworld_long)

mm_world_1 <- mmworld_long %>%
  clean_names()%>%
  filter(!is.na(ratio)) %>%
  filter(year >= 2016 & year <= 2021)

World Maternal Mortality Analysis and visualization

Here I did an aggregate mean of the continents, as well as a box plot and bar graph by continent

aggregate(ratio ~ continent, data = mm_world_1, FUN = mean)

##   continent      ratio
## 1    Africa 414.651235
## 2   America  79.015152
## 3      Asia  75.399306
## 4    Europe   7.803419
## 5   Oceania  64.766667

ggplot(mm_world_1, aes(x = continent, y = ratio, fill = continent)) +
  geom_boxplot() +
  labs(title = "Rate of Maternal Mortality per 100,000 by Continent", x = "Continent", y = "Rate per 100,000")

ggplot(mm_world_1, aes(x = year, y = ratio, fill = continent)) +
   geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Maternal Mortality Rates by Year and Continent",
       x =  "MM Year",
       y = "MM Ratio per 100,000",
       color = "Continent") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(mmworld_long, aes(x = Year, y = Ratio, fill = Continent)) +
   geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "Maternal Mortality Rates by Year and Continent",
       x =  "MM Year",
       y = "MM Ratio per 100,000",
       color = "Continent") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 352 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Cleaning NYC Maternal Mortality Data 2017-2021

Remove NAs

data$deaths <- as.numeric(data$deaths)
data <- data %>%
    filter(year != "Year")

Summarize NYC Data (New feature and challenge)

Here I am recreating the NYC Maternal mortality Annual Report but using data from 2021. Then I am using all the data from the data set and creating a table summary by year to export into a word document. Similar to what you would create as a table 1 for a journal article.

nyc_data2 <- data %>%
    filter(year== "2021")

count_data <- uncount(nyc_data2, deaths)
count_data2 <- uncount(data, deaths)

datatable(count_data2)

count_data %>%
  select(related, race_ethnicity, borough, underlying_cause) %>%
  tbl_summary(
              statistic = all_categorical() ~ "{n} ({p}%)",
              label = list(related ~ "Pregnancy-relatedness",
                           race_ethnicity ~ "Race/ethnicity",
                           borough ~ "Borough of residence",
                           underlying_cause ~ "Underlying Cause of Death"))

Characteristic	N = 228¹
Pregnancy-relatedness
All	116 (51%)
Pregnancy-associated but not related	8 (3.5%)
Pregnancy-related	89 (39%)
Unable to Determine	15 (6.6%)
Race/ethnicity
All	140 (61%)
Asian/Pacific Islander	12 (5.3%)
Black non-Latina	34 (15%)
Latina	32 (14%)
Other	2 (0.9%)
White non-Latina	8 (3.5%)
Borough of residence
All	140 (61%)
Bronx	21 (9.2%)
Brooklyn	23 (10%)
Manhattan	15 (6.6%)
Queens	14 (6.1%)
Rest of State	11 (4.8%)
Staten Island	4 (1.8%)
Underlying Cause of Death
All	176 (77%)
Asthma/ pulmonary conditions	1 (0.4%)
Cancer	5 (2.2%)
Cardiovascular Conditions	2 (0.9%)
Embolism	2 (0.9%)
Hemorrhage	4 (1.8%)
Infection/Sepsis	5 (2.2%)
Mental Health Conditions (Overdose related to substance use disorder)	20 (8.8%)
Mental Health Conditions (Suicide)	1 (0.4%)
Other	10 (4.4%)
Unknown COD	2 (0.9%)
¹ n (%)

table1 <- count_data2 %>%
  select(year, related, race_ethnicity, borough, underlying_cause) %>%
  tbl_summary(by = year,
              statistic = all_categorical() ~ "{n} ({p}%)",
              label = list(related ~ "Pregnancy-relatedness",
                           race_ethnicity ~ "Race/ethnicity",
                           borough ~ "Borough of residence",
                           underlying_cause ~ "Underlying Cause of Death"))

flex <- table1 %>%
  as_flex_table() %>%
  theme_box()

read_docx() %>%
  body_add_flextable(flex) %>%
  body_end_section_landscape() %>%
  print("flex.docx")

Statistical Analysis on NYC Data (Race and Ethnicity ANOVA)

This analysis aims to examine the relationship between race and ethnicity and deaths in NYC using data from 2016-2021.

Research Question: Is there a difference in death means for those who identify as Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina?

H0: There is no difference in death means for those who identify as Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina.

HA: There is a difference in death means for at least one of the race/ethnicity groups (Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina.

analysis <- data %>%
  filter(race_ethnicity != "All")

analysis%>%
  group_by(race_ethnicity) %>%
  summarise(total_deaths = sum(deaths))

## # A tibble: 5 × 2
##   race_ethnicity         total_deaths
##   <chr>                         <dbl>
## 1 Asian/Pacific Islander           42
## 2 Black non-Latina                193
## 3 Latina                          129
## 4 Other                             7
## 5 White non-Latina                 72

analysis %>%
  group_by(race_ethnicity) %>%
  summarise(
    n = sum(deaths),
    mean_deaths = mean(deaths),
    median_deaths = median(deaths),
    sd_deaths  = sd(deaths),
    min_deaths  = min(deaths),
    max_deaths  = max(deaths)
  )

## # A tibble: 5 × 7
##   race_ethnicity     n mean_deaths median_deaths sd_deaths min_deaths max_deaths
##   <chr>          <dbl>       <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1 Asian/Pacific…    42        4.2            4.5     1.81           1          7
## 2 Black non-Lat…   193       16.1           14       6.29           6         26
## 3 Latina           129       10.8            9       5.79           2         21
## 4 Other              7        1.75           1.5     0.957          1          3
## 5 White non-Lat…    72        6              5.5     3.44           1         11

hist(analysis$deaths, 
     main = "Maternal Mortality Rates Distribution", 
     xlab = "Maternal Mortality Rates", 
     col = "darkgreen")

summary(analysis$deaths)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    7.00    8.86   11.75   26.00

ggplot(analysis, aes(x = race_ethnicity, y = deaths, fill = race_ethnicity)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set4") +
  labs(title = "Maternal Mortality Rates by Racial or Ethnic Group", 
       x = "Race/Ethnicity", 
       y = "Maternal Mortality Rates per 100,000 individuals") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Unknown palette: "Set4"

aov_result <- aov(deaths ~ race_ethnicity, data = analysis)
summary(aov_result)

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## race_ethnicity  4 1186.5  296.63   13.82 1.97e-07 ***
## Residuals      45  965.5   21.46                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Residuals from the model
residuals1 <- resid(aov_result)

# Histogram to check normality
hist(residuals1, main = "Histogram of Residuals", 
     xlab = "Residuals", col = "pink", border = "black", 
     breaks = 30)

# Normal probability plot to check normality
qqnorm(residuals1)

The p-value is 1.97e-07.

Since the p-value (1.97e-07) is less than α= 0.05, we reject the null hypothesis. There is a statistically significant difference in death means between those who identify as Asian/Pacific Islander, Black non-Latina, Latina, Other, and White non-Latina.

Statistical Analysis on NYC Data (Borough ANOVA)

This analysis aims to examine the relationship between borough and deaths in NYC using data from 2016-2021.

Research Question: Is there a difference in death means for those who live in Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island?

H0: There is no difference in death means for those who live in Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island.

HA: There is a difference in death means for at least one of the boroughs (Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island.)

analysis3 <- data %>%
  filter(borough != "All")

analysis4 <- analysis3 %>%
  group_by(borough) %>%
  summarise(total_deaths = sum(deaths))

analysis3 %>%
  group_by(borough) %>%
  summarise(
    n = sum(deaths),
    mean_deaths = mean(deaths),
    median_deaths = median(deaths),
    sd_deaths  = sd(deaths),
    min_deaths  = min(deaths),
    max_deaths  = max(deaths)
  )

## # A tibble: 6 × 7
##   borough           n mean_deaths median_deaths sd_deaths min_deaths max_deaths
##   <chr>         <dbl>       <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1 Bronx           101        8.42             8      4.36          1         16
## 2 Brooklyn        137       11.4             11      4.32          5         20
## 3 Manhattan        50        4.17             4      2.37          1         10
## 4 Queens           87        7.25             8      3.77          0         13
## 5 Rest of State    40        3.33             3      2.02          0          7
## 6 Staten Island    28        2.55             2      2.21          0          8

ggplot(analysis3, aes(x = borough, y = deaths, fill = borough)) +
  geom_bar(stat = "identity") +
    scale_fill_brewer(palette = "Set4") +
  labs(title = "Maternal Mortality Rates by NYC Boroughs", 
       x = "Boroughs", 
       y = "Maternal Mortality Rates per 100,000 individuals") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Unknown palette: "Set4"

aov_result1 <- aov(deaths ~ borough, data = analysis3)
summary(aov_result1)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## borough      5  693.8  138.76   12.44 1.77e-08 ***
## Residuals   65  725.1   11.16                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Residuals from the model
residuals <- resid(aov_result1)

# Histogram to check normality
hist(residuals, main = "Histogram of Residuals", 
     xlab = "Residuals", col = "coral", border = "black", 
     breaks = 30)

# Normal probability plot to check normality
qqnorm(residuals)

The p-value is 1.77e-08.

Since the p-value (1.77e-08) is less than α= 0.05, we reject the null hypothesis. There is a statistically significant difference in death means for at least one of the boroughs (Bronx, Brooklyn, Manhattan, Queens, Rest of State, and Staten Island.)

Combine NYC and Global Dataset to Compare Rates

data_combine <- data %>%
  select(year, deaths) %>%
   rename(ratio = deaths)
  
data_combine$location <- "NYC"

mm_world_combine <- mm_world_1 %>%
  select(year, continent, ratio)%>%
  rename(location = continent)

stacked_data <- bind_rows(data_combine, mm_world_combine)

datatable(stacked_data)

ggplot(stacked_data, aes(x = year, y = ratio, fill = location)) +
   geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Maternal Mortality Rates by Year and Location",
       x =  "MM Year",
       y = "MM Ratio per 100,000",
       color = "Location") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

Using the NYC Maternal Mortality Dataset and the Global Maternal Mortality Dataset, retrieved using API and as a csv. I was able to use techniques of tidying data to perform visualizations and analysis. I also used gt_summary as a new feature to share with the class. Although this feature took me a little while to figure out and reproduce similar to the original, I found a work around that produced the results I wanted to. My statistical analysis showed that both borough and race/ethnicity has statistically significant difference in means, although my plots and residual plots showed that I did not meet the assumptions of ANOVA - therefore my data is not reliable. My last stacked dataset of NYC MM rates and Global MM rates by continent shows that NYC has very close rates to all of Europe.