flowchart LR
A["GapMinder (1751-2008)"]
B["WHO (2000-2017)"]
A --> C{"OWID (1751-2017)"}
B --> C{"OWID (1751-2017)"}
The definition of maternal mortality is:
“A death of a woman while pregnant or within 42 days of termination of pregnancy, irrespective of the duration and site of pregnancy, from any cause related or aggravated by the pregnancy or its management, but not from accidental or incidental causes.”
The definition of maternal mortality ratio is:
“The number of maternal deaths during a given time period per 100,000 live births during the same time period.”
In this notebook we will create a long-term time-series of maternal mortality ratio. To do this we will combine two existing datasets:
flowchart LR
A["GapMinder (1751-2008)"]
B["WHO (2000-2017)"]
A --> C{"OWID (1751-2017)"}
B --> C{"OWID (1751-2017)"}
We make the assumption that their methods are comparable and that the data can be combined without transformation.
In years where there is an overlap in both time-series we use the data from the WHO.
Firstly we download the data from GapMinder and save it locally. Then we read in the GapMinder data, clean up the column names and take a look at the data.
if (!file.exists('data/input/happiness_report_historical.xls')){
download.file('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls', 'data/input/happiness_report_historical.xls')
}
if (!file.exists('data/input/happiness_report_2022.xls')){
download.file('https://happiness-report.s3.amazonaws.com/2022/Appendix_2_Data_for_Figure_2.1.xls', 'data/input/happiness_report_2022.xls')
}dir.create("data/input", recursive = TRUE, showWarnings = FALSE)
download.file('https://www.gapminder.org/documentation/documentation/gapdata010.xls', paste0('data/input/',date,'_gapminder_data.xls'))
df <- read_xls(paste0('data/input/',date,'_gapminder_data.xls'), col_names = TRUE)
df <- df[-(1:16),]
df <- janitor::clean_names(df)
head(df)There are some issues with the GapMinder data - we will manually clean them here.
The first three rows in the ‘year’ column seem to be wrong as it seems as if each row covers 110 years, but in another source they cover only 10 years.
I will replace them manually here with the middle of each decade, as is used in the other data from the same source.
There are cases where the year for Finland has been entered incorrectly three times, there are two 1772s, 1775s & 1967s, the second of each should be 1872, 1875 and 1957 respectively.
There are also two errors for New Zealand, and one error for both Sweden and the US.
Sri Lanka’s maternal mortality rate seems to drop to zero in 1990, here we set it to NA instead.
df$year[1:3] <- c(1875, 1885,1895)
df['year'][df$year == '1772' & df$country == 'Finland' & df$maternal_deaths == 487,] <- '1872'
df['year'][df$year == '1775' & df$country == 'Finland' & df$maternal_deaths == 629,] <- '1875'
df['year'][df$year == '1967' & df$country == 'Finland' & df$maternal_deaths == 77,] <- '1957'
df['year'][df$year == '1967' & df$country == 'Sweden' & df$maternal_deaths == 39,] <- '1957'
df['year'][df$year == '1967' & df$country == 'United States' & df$maternal_deaths == 1766.28,] <- '1957'
df['year'][df$year =='1989-02'& df$country == 'New Zealand',] <- '1889-02'
df <- df[-which(df$year == '1950' & df$country == 'New Zealand' & df$mmr == 90),]
df['mmr'][df$country == "Sri Lanka" & df$year == 1990,] <- NAFor some rows a range of years is given for a particular maternal mortality rate. We want to find the mid-point of that range. We use this function to find the mid-value of years given.
mid_year <- function(year) {
if(grepl("-", year)){
year_split <- unlist(strsplit(year, split = "-"))
year_begin <- year_split[1]
year_end_len <- nchar(year_split[2])
year_end_pref <- 4 - year_end_len
if (year_end_pref > 0) {
year_pref <- substr(year_begin, 1, year_end_pref)
year_end <- as.numeric(paste(c(year_pref, year_split[2]),collapse = ""))
year_out <- round(mean(c(as.numeric(year_begin), year_end)))
}else{
year_out <- round(mean(as.numeric(c(year_begin,year_split[2]))))
}
}else{
year_out <- year
}
return(year_out)
}
year_out_all <- unlist(lapply(df$year, mid_year))
df$mid_year <- as.numeric(year_out_all)Let’s check the mid-year estimates have worked as expected. There are three cases where the mid-year estimate is not the mid-point of the range of years. Let’s fix these manually. We will keep only the columns we want to take forward.
df %>%
select(year, mid_year) %>%
filter(grepl("-",year)) %>%
mutate(year_begin = substr(year, 1, 4)) %>%
filter(year_begin >= mid_year)
df['mid_year'][df$year == "1897-01",] <- 1899
df['mid_year'][df$year == "1889-02",] <- 1895
df['mid_year'][df$year == "1889-03",] <- 1896
df <- df %>%
select(entity = country, year = mid_year, maternal_mortality_rate = mmr)Read in data from WDI - in future iterations we will read this in from ETL so it is auto updated. The WDI sources this variable from the WHO so we will henceforth refer to it as the WHO data.
download.file('https://api.worldbank.org/v2/en/indicator/SH.STA.MMRT?downloadformat=csv', paste0('data/input/',date,'_wdi_maternal_mortality.zip'))
unzip(paste0('data/input/',date,'_wdi_maternal_mortality.zip'), exdir = "data/input")
des_file <- unzip(paste0('data/input/',date,'_wdi_maternal_mortality.zip'), list=TRUE)
file_name <- des_file$Name[startsWith( des_file$Name,"API_SH.STA.MMRT_DS2")]
who <- clean_names(read.csv(paste0('data/input/', file_name), skip = 4))Pivot and clean the WHO data to long format. Check the GapMinder countries are in the WHO dataset
who_clean <- who %>%
select(country_name, starts_with("X")) %>%
select(-x) %>%
pivot_longer(starts_with("x")) %>%
mutate(year = as.numeric(gsub("x", "", name))) %>%
select(entity = country_name, year, maternal_mortality_rate = value)
head(who_clean)
all(unique(df$entity) %in% who_clean$entity)Combine the two datasets and add a source column so we can keep track of where each datapoint comes from.
df$source <- 'GapMinder'
who_clean$source <- 'WHO_GHO'
df_all <- rbind(df, who_clean)
df_all$maternal_mortality_rate <- as.numeric(df_all$maternal_mortality_rate)
head(df_all)Dealing with duplicate years - there are two identical values for NZ in 1950, we can drop one with distinct(). We remove NAs where there is a non-NA value in either Gapminder or WHO. If there are values for both GapMinder and the WHO data we have a preference for the WHO data.
df_check <-df_all %>%
group_by(entity, year) %>%
add_count() %>%
ungroup()
df_sel <- df_check[-which(df_check$n >1 & is.na(df_check$maternal_mortality_rate)),]
df_sel <-df_sel %>%
group_by(entity, year) %>%
add_count() %>%
ungroup() %>%
distinct()
df_sel <- df_sel[-which(df_sel$nn >1 & df_sel$source == 'GapMinder'),]
df_check <- df_sel %>%
group_by(entity, year) %>%
add_count()
sum(df_check$nnn > 1) == 0
df_sel <- df_sel %>% select(-c(n, nn))Standardising the country names for use on OWID.
stand_countries <- read.csv("data/input/countries_to_standardise_country_standardized.csv")
df_fin <- df_sel %>% left_join(.,stand_countries, by = c("entity" = "Country")) %>%
mutate(entity = Our.World.In.Data.Name) %>%
select(-c(Our.World.In.Data.Name, source))
df_fin$maternal_mortality_rate[is.na(df_fin$maternal_mortality_rate)] <- ""We don’t want to include all of the available regions in the data as it is difficult for us to define them clearly, so we drop out a selection here. We also plot the data for Sweden to check the time-series looks as expected.
entities_to_drop <- c("Africa Eastern and Southern", "Africa Western and Central", "Arab World", "Early-demographic dividend", "East Asia & Pacific (excluding high income)", "East Asia & Pacific (IDA & IBRD countries)", "Euro area", "Europe & Central Asia (excluding high income)", "Europe & Central Asia (IDA & IBRD countries)", "Fragile and conflict affected situations", "Heavily indebted poor countries (HIPC)", "IBRD only", "IDA & IBRD total", "IDA blend", "IDA only", "IDA total", "Late-demographic dividend", "Latin America & Caribbean (excluding high income)", "Latin America & the Caribbean (IDA & IBRD countries)", "Least developed countries: UN classification", "Middle East & North Africa (excluding high income)",
"Middle East & North Africa (IDA & IBRD countries)", "Not classified", "Other small states", "Pacific island small states", "Post-demographic dividend", "Pre-demographic dividend", "Small states", "South Asia (IDA & IBRD)", "Sub-Saharan Africa (excluding high income)", "Sub-Saharan Africa (IDA & IBRD countries)")
df_fin <- df_fin %>%
filter(!entity %in% entities_to_drop)
fig <- plot_ly(data = df_fin %>% filter(entity== "Sweden"), x = ~year,
text = ~paste("MMR: ", round(as.numeric(maternal_mortality_rate),2), '<br>Year:', year)) %>% add_trace(y = ~as.numeric(maternal_mortality_rate), name = 'MMR',mode = 'lines') %>% layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'Maternal Mortality Ratio (per 100,000 live births)'))
figWe can also calculate the proportional change since 2000. This is useful for understanding the change in the data over time and we can use this data in a slope chart.
df_clean <- df_fin %>%
filter(year %in% c(2000, 2017)) %>%
filter(complete.cases(.)) %>%
mutate(maternal_mortality_rate = as.numeric(maternal_mortality_rate)) %>%
group_by(entity) %>%
add_count() %>%
filter(n == 2)
df_wide <- df_clean %>%
pivot_wider(., names_from = 'year',names_glue = "{.value}_{year}", values_from = 'maternal_mortality_rate') %>%
mutate(change = ((maternal_mortality_rate_2017 - maternal_mortality_rate_2000)/maternal_mortality_rate_2000)*100, baseline = 0) %>%
select(entity, change, baseline) %>%
pivot_longer(cols = c("change", "baseline")) %>%
mutate(name = gsub("change", 2017, name), name = gsub("baseline", 2000, name)) %>%
mutate(year = as.numeric(name)) %>%
select(entity, year, relative_change_mmr = value)
slope_ggplot <- df_wide %>%
highlight_key(., ~entity) %>%
ggplot( aes(x = year, y = round(relative_change_mmr,2), group = entity, text = entity))+
labs(x = "Year", y = "Relative change since 2000 (%)", title = "Relative change in maternal mortality ratio since 2000 (%)")+
geom_line(color = "#888888" )+
theme_bw()+
theme(legend.position="none",
panel.background = element_rect(fill = "#ffffff"),
axis.ticks = element_blank(),
axis.text = element_text(color = "#888888"))
s <- attrs_selected(
line = list(color = "#0000FF",
width = 5)
)
ggplotly(tooltip = "text", slope_ggplot)%>%
config(displayModeBar = FALSE) %>%
highlight(on = "plotly_hover",
opacityDim = 0.7,
color = "blue",
selected = s) %>%
layout(hoverlabel = list(bgcolor = "blue", font = list(color = "FFFFFF", size = 14)))Lastly, we combine the proportional change data with the full time-series of maternal mortality rate and write out the data to a csv file.
df_out <- df_wide %>% right_join(., df_fin, by = c("entity","year"))
df_out$relative_change_mmr[is.na(df_out$relative_change_mmr)] <- ""
write.csv(df_out, "data/output/maternal_mortality_rate_upload.csv", row.names = FALSE)