knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE, include=TRUE, error=FALSE)
library(tidyverse)
library(RMySQL) # Older MysQL library (note: I could NOT get the RMariaDB lib to work)
library(formattable) # Pretty print tables
library(xlsx)
library(rvest)
library(stargazer)Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Project2
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:
In the late 1990’s I worked for an NGO that focused on training doctors, nurses and midwives on women’s reproductive health issues and health during pregancy. The NGO developed and tested training programs in very rural areas around the world to teach health care providers how to provide care for women with no easy access to primary health care facilities or doctors.
In our Weekly Discussion forum, Euclid Zhang shared a link to UNICEF data and while browsing their datasets, I stumbled on a few datasets which really spoke to me and the incredible work I once contributed toward. The datasets provide data and trends in both maternal health and outcomes along with indicators by country. For Project 2, rather than pick 3 disparate datasets to analyze, I’m going to instead follow this theme of maternal outcome around the world.
For an understanding of the problem space, see Maternal mortality.
Some key facts they provide: - Every day in 2017, approximately 810 women died from preventable causes related to pregnancy and childbirth. - Between 2000 and 2017, the maternal mortality ratio (MMR, number of maternal deaths per 100,000 live births) dropped by about 38% worldwide. - 94% of all maternal deaths occur in low and lower middle-income countries. - Young adolescents (ages 10-14) face a higher risk of complications and death as a result of pregnancy than other women. - Skilled care before, during and after childbirth can save the lives of women and newborns.
The primary datasets I’ll explore are:
Each of these datasets required Tidy processing to clean up. I also join the data where possible by country and year to tease out additional patterns not available within any one dataset.
To facilitate joining tables using different Country label conventions, I scraped a lookup table with Country names and ISO3 codes which I join to my various datasets.
These datasets explore different years, but have broad overlap between 2000-2015 (every 5 years). My analysis will be somewhat limited by the available data. If I were doing a deeper dive, I would look for more complete data and/or limit my questions to those countries that have a richer set of data. My guess is that better data may indicate more attention to tackling problems with maternal mortality and as a consequence, better data might come from countries with better reporting and more on-the-ground efforts.
During research I found that different datasets included ISO2 (Alpha 2), ISO3 (Alpha 3) and/or descriptive country names. Just in case it’s needed, I wanted a lookup table I can use to join different datasets using different country name conventions.
url <- "https://www.iban.com/country-codes"
cache_fn <- 'country_codes.csv'
isCacheFound <- FALSE
local_cache_fn <- paste('data/cache/', cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
}
# Cached ata file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(local_cache_fn, col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else {
# Load Raw Data
data_df <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[@id="myTable"]') %>%
html_table()
data_df <- data_df[[1]]
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}
iso3_codes <- data_df
isTidy_iso3_codes <- FALSE
print(msg)## [1] "Cached CSV data loaded."
knitr::kable(iso3_codes[1:10, ], booktabs = T, caption = "ISO2 and ISO3 Country Codes. Raw data, first 10 rows shown. (Source: http://iban.com/country-codes)")| Country | Alpha-2 code | Alpha-3 code | Numeric |
|---|---|---|---|
| Afghanistan | AF | AFG | 4 |
| Åland Islands | AX | ALA | 248 |
| Albania | AL | ALB | 8 |
| Algeria | DZ | DZA | 12 |
| American Samoa | AS | ASM | 16 |
| Andorra | AD | AND | 20 |
| Angola | AO | AGO | 24 |
| Anguilla | AI | AIA | 660 |
| Antarctica | AQ | ATA | 10 |
| Antigua and Barbuda | AG | ATG | 28 |
# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_iso3_codes == FALSE) {
# Convert year columns into a variable
iso3_codes <- iso3_codes %>%
mutate(ISO3 = `Alpha-3 code`, ISO2 = `Alpha-2 code`) %>%
select(Country, ISO3, ISO2)
isTidy_iso3_codes <- TRUE # if this cell ir rerun, don't try to re-tidy
} else {
print('Already processed, nothing to do.')
}knitr::kable(iso3_codes[1:10, ], booktabs = T, caption = "ISO2 and ISO3 Country Codes. Tidy data, first 10 rows shown.")| Country | ISO3 | ISO2 |
|---|---|---|
| Afghanistan | AFG | AF |
| Åland Islands | ALA | AX |
| Albania | ALB | AL |
| Algeria | DZA | DZ |
| American Samoa | ASM | AS |
| Andorra | AND | AD |
| Angola | AGO | AO |
| Anguilla | AIA | AI |
| Antarctica | ATA | AQ |
| Antigua and Barbuda | ATG | AG |
Changes in country population might change how we interpret maternal mortality rate. While the rate should theoretically be independent from population (by definition MMR is maternal deaths per 100,000 live births), if there is a large increase or decrease in population, this might impact conditions around pregnency (nutrition, access to health care, etc) and/or conditions at time of birth (e.g. access to health care). Since MMR is “deaths per live briths”, I’ll end up needing population data to help calculate the number of live births.
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population"
cache_fn <- 'population.csv'
isCacheFound <- FALSE
local_cache_fn <- paste('data/cache/', cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
}
# Cached ata file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(local_cache_fn, col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else {
# Load Raw Data
data_df <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
html_table()
data_df <- data_df[[1]]
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}
population_df <- data_df
isTidy_population <- FALSE
print(msg)## [1] "Cached CSV data loaded."
knitr::kable(population_df[1:10, ], booktabs = T, caption = "Population Data. Raw data, first 10 rows shown. (Source: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)")| Country (or dependent territory) | 1985 | % | 1990 | %_1 | 1995 | %_2 | 2000 | %_3 | 2005 | %_4 | 2010 | %_5 | 2015 | %_6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 13120 | -2.70 | 13569 | 0.67 | 19446 | 7.46 | 22462 | 2.93 | 26335 | 3.23 | 29121 | 2.03 | 32565 | 2.26 |
| Albania | 2957 | 2.05 | 3245 | 1.88 | 3159 | -0.54 | 3159 | 0.00 | 3025 | -0.86 | 2987 | -0.25 | 3030 | 0.28 |
| Algeria | 22009 | 3.19 | 25191 | 2.74 | 28322 | 2.37 | 30639 | 1.58 | 32918 | 1.45 | 35950 | 1.78 | 39543 | 1.92 |
| American Samoa | 39 | 3.57 | 48 | 4.09 | 54 | 2.69 | 58 | 1.39 | 57 | -0.28 | 56 | -0.53 | 55 | -0.41 |
| Andorra | 45 | 5.84 | 53 | 3.41 | 64 | 3.70 | 66 | 0.58 | 77 | 3.18 | 85 | 2.12 | 86 | 0.25 |
| Angola | 8390 | 3.09 | 9486 | 2.48 | 11000 | 3.01 | 12683 | 2.89 | 14770 | 3.09 | 17043 | 2.90 | 19626 | 2.86 |
| Anguilla | 7 | 1.35 | 9 | 3.84 | 10 | 3.20 | 12 | 3.02 | 14 | 2.80 | 15 | 2.40 | 17 | 2.14 |
| Antigua and Barbuda | 65 | -1.24 | 65 | -0.07 | 69 | 1.38 | 76 | 1.86 | 82 | 1.53 | 87 | 1.31 | 93 | 1.28 |
| Argentina | 30672 | 1.57 | 33036 | 1.50 | 35274 | 1.32 | 37336 | 1.14 | 39182 | 0.97 | 41344 | 1.08 | 43432 | 0.99 |
| Armenia | 3465 | 2.02 | 3530 | 0.37 | 3131 | -2.37 | 3101 | -0.19 | 3085 | -0.10 | 3072 | -0.08 | 3057 | -0.10 |
# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_population == FALSE) {
# Convert year columns into a variable
population_df <- population_df %>%
mutate(Country = `Country (or dependent territory)`) %>%
select(Country, `2000`, `2005`, `2010`, `2015`) %>%
gather(Year, Population, -Country)
# Fix datatypes
population_df$Year <- as.numeric(population_df$Year)
# Filter years
population_df <- population_df %>% filter(Year >= 2000 & Year <= 2015) %>% arrange(Country, Year)
population_df$Population <- population_df$Population* 1000
population_df <- population_df %>% left_join(iso3_codes)
isTidy_population <- TRUE # if this cell ir rerun, don't try to re-tidy
} else {
print('Already processed, nothing to do.')
}knitr::kable(population_df[1:10, ], booktabs = T, caption = "Population Data. Tidy date, first 10 rows shown.")| Country | Year | Population | ISO3 | ISO2 |
|---|---|---|---|---|
| Afghanistan | 2000 | 22462000 | AFG | AF |
| Afghanistan | 2005 | 26335000 | AFG | AF |
| Afghanistan | 2010 | 29121000 | AFG | AF |
| Afghanistan | 2015 | 32565000 | AFG | AF |
| Albania | 2000 | 3159000 | ALB | AL |
| Albania | 2005 | 3025000 | ALB | AL |
| Albania | 2010 | 2987000 | ALB | AL |
| Albania | 2015 | 3030000 | ALB | AL |
| Algeria | 2000 | 30639000 | DZA | DZ |
| Algeria | 2005 | 32918000 | DZA | DZ |
While there is variation between countries, the overall birthrate worldwide has declined. To understand changes in maternal mortality, we will need to estimate how many births are occuring during years of interest in each country. It’s possible that changes in birth rates could have an impact on MMR.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=excel'
raw_fn <- 'API_SP.DYN.CBRT.IN_DS2_en_excel_v2_248743.xlsx'
cache_fn <- 'births.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 3 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 0 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('Country', 'ISO3', 'Indicator', 'Indicator_Code', as.character(seq(1960, 2017, 1)))
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn <- paste('data/cache/', cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the dataset
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)## [1] "Found cached copy: data/cache/births.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
birthrate_df <- data_df
isTidy_birthrate <- FALSE
print(msg)## [1] "Cached CSV data loaded."
knitr::kable(birthrate_df[1:10, ], booktabs = T, caption = "Birthrate Data. Raw data, first 10 rows shown. Births per 1000 people. (Source: http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=csv)")| Country | ISO3 | Indicator | Indicator_Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aruba | ABW | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 35.67900 | 34.52900 | 33.3200 | 32.05000 | 30.73700 | 29.4130 | 28.12100 | 26.90800 | 25.81700 | 24.87200 | 24.09900 | 23.50500 | 23.06800 | 22.76000 | 22.56100 | 22.45200 | 22.41400 | 22.4240 | 22.45400 | 22.47800 | 22.47200 | 22.42400 | 22.32900 | 22.18700 | 21.98900 | 21.72600 | 21.39700 | 21.00800 | 20.5700 | 20.08900 | 19.57100 | 19.02100 | 18.44600 | 17.85900 | 17.27000 | 16.69100 | 16.13200 | 15.59800 | 15.09000 | 14.61500 | 14.17300 | 13.76200 | 13.37500 | 13.01000 | 12.66700 | 12.34800 | 12.05300 | 11.78800 | 11.556 | 11.361 | 11.21400 | 11.12300 | 11.0900 | 11.11100 | 11.17900 | 11.28100 | 11.4040 | 11.53200 |
| Afghanistan | AFG | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 51.27900 | 51.37300 | 51.4570 | 51.53000 | 51.58900 | 51.6310 | 51.65200 | 51.65000 | 51.62300 | 51.57400 | 51.50200 | 51.41100 | 51.30300 | 51.18400 | 51.05800 | 50.93000 | 50.80300 | 50.6780 | 50.55500 | 50.43600 | 50.32100 | 50.21000 | 50.09800 | 49.98400 | 49.86500 | 49.73500 | 49.58600 | 49.41800 | 49.2360 | 49.04800 | 48.88000 | 48.76300 | 48.70900 | 48.71700 | 48.77000 | 48.83500 | 48.87000 | 48.83300 | 48.68800 | 48.41900 | 48.02100 | 47.50500 | 46.90100 | 46.23100 | 45.50700 | 44.72300 | 43.87000 | 42.94400 | 41.949 | 40.903 | 39.82900 | 38.75000 | 37.6900 | 36.67000 | 35.70600 | 34.80900 | 33.9810 | 33.21100 |
| Angola | AGO | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 49.08000 | 48.77900 | 48.5470 | 48.43000 | 48.45000 | 48.6220 | 48.93600 | 49.34300 | 49.78700 | 50.23100 | 50.61900 | 50.90300 | 51.06200 | 51.09400 | 51.00500 | 50.82500 | 50.60000 | 50.3860 | 50.22600 | 50.13900 | 50.13400 | 50.20700 | 50.32200 | 50.44900 | 50.56900 | 50.66300 | 50.71200 | 50.71100 | 50.6570 | 50.54700 | 50.38300 | 50.16800 | 49.91900 | 49.65200 | 49.37800 | 49.11300 | 48.87000 | 48.65200 | 48.46000 | 48.29300 | 48.15000 | 48.02700 | 47.91100 | 47.78600 | 47.63900 | 47.45300 | 47.21500 | 46.92000 | 46.563 | 46.143 | 45.65600 | 45.10200 | 44.4930 | 43.84700 | 43.18200 | 42.52000 | 41.8820 | 41.28100 |
| Albania | ALB | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 40.92400 | 40.36800 | 39.6270 | 38.72300 | 37.69500 | 36.5990 | 35.49600 | 34.43500 | 33.45800 | 32.59000 | 31.83700 | 31.18300 | 30.58700 | 30.01900 | 29.47300 | 28.94900 | 28.45500 | 28.0040 | 27.60600 | 27.26200 | 26.98100 | 26.77200 | 26.62700 | 26.52800 | 26.45200 | 26.36700 | 26.24100 | 26.04700 | 25.7620 | 25.37200 | 24.86700 | 24.24500 | 23.52900 | 22.74200 | 21.90200 | 21.02000 | 20.10600 | 19.17300 | 18.23800 | 17.32100 | 16.43600 | 15.59000 | 14.79000 | 14.04800 | 13.38100 | 12.82100 | 12.39800 | 12.11800 | 11.973 | 11.945 | 12.00100 | 12.10000 | 12.1970 | 12.25700 | 12.25900 | 12.19700 | 12.0800 | 11.93400 |
| Andorra | AND | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 11.90000 | 11.00000 | 11.6000 | 12.50000 | 11.90000 | 11.90000 | 12.10000 | 11.40000 | 10.90000 | 11.00000 | 10.90000 | 11.20000 | 11.90000 | 12.60000 | 11.30000 | 11.80000 | 11.20000 | 10.30000 | 10.90000 | 10.70000 | 10.60000 | 10.10000 | 10.400 | 9.900 | 9.80000 | NA | 9.5000 | NA | NA | NA | 8.8000 | NA |
| Arab World | ARB | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 47.79008 | 47.55839 | 47.3276 | 47.09162 | 46.84421 | 46.5771 | 46.28291 | 45.96055 | 45.61137 | 45.23716 | 44.84362 | 44.44035 | 44.03865 | 43.64783 | 43.27485 | 42.92493 | 42.60063 | 42.2929 | 41.98992 | 41.68051 | 41.34983 | 40.98353 | 40.57024 | 40.10007 | 39.56784 | 38.96898 | 38.30287 | 37.58055 | 36.8177 | 36.02979 | 35.31807 | 34.52188 | 33.83458 | 33.04784 | 32.28643 | 31.50175 | 30.83459 | 30.21717 | 29.64974 | 29.14368 | 28.70911 | 28.34776 | 28.05892 | 27.83078 | 27.65783 | 27.53771 | 27.47728 | 27.46073 | 27.463 | 27.496 | 27.51344 | 27.48487 | 27.3893 | 27.21144 | 26.94078 | 26.57699 | 26.1348 | 25.64802 |
| United Arab Emirates | ARE | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 47.26600 | 46.70500 | 46.0930 | 45.42600 | 44.69400 | 43.8560 | 42.85700 | 41.67800 | 40.32900 | 38.84300 | 37.27600 | 35.69500 | 34.17900 | 32.80400 | 31.62500 | 30.70300 | 30.08200 | 29.7350 | 29.61100 | 29.66600 | 29.83900 | 30.05900 | 30.25100 | 30.34300 | 30.28200 | 30.02200 | 29.53700 | 28.85900 | 28.0260 | 27.05900 | 25.98400 | 24.82800 | 23.63100 | 22.43700 | 21.28200 | 20.21000 | 19.25300 | 18.41000 | 17.67500 | 17.04300 | 16.49400 | 15.99600 | 15.51400 | 15.02400 | 14.51800 | 14.00100 | 13.49000 | 13.01000 | 12.581 | 12.208 | 11.88800 | 11.61600 | 11.3770 | 11.15900 | 10.95800 | 10.77200 | 10.6030 | 10.45600 |
| Argentina | ARG | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 23.77500 | 23.56900 | 23.3720 | 23.19100 | 23.03300 | 22.9090 | 22.83000 | 22.80400 | 22.83400 | 22.92400 | 23.09000 | 23.35000 | 23.69500 | 24.09800 | 24.52400 | 24.91200 | 25.19100 | 25.3200 | 25.27700 | 25.06600 | 24.71800 | 24.28800 | 23.84600 | 23.45100 | 23.12800 | 22.88900 | 22.72700 | 22.60600 | 22.4930 | 22.37200 | 22.22600 | 22.03600 | 21.80200 | 21.52600 | 21.21400 | 20.88100 | 20.54500 | 20.22900 | 19.94500 | 19.70200 | 19.49900 | 19.32800 | 19.17500 | 19.02600 | 18.87900 | 18.73500 | 18.59600 | 18.46900 | 18.355 | 18.252 | 18.15400 | 18.05500 | 17.9490 | 17.83100 | 17.69800 | 17.54800 | 17.3830 | 17.20500 |
| Armenia | ARM | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | 38.50900 | 37.21700 | 35.5610 | 33.62900 | 31.54000 | 29.4470 | 27.49600 | 25.78600 | 24.38900 | 23.35500 | 22.69400 | 22.36600 | 22.26200 | 22.28300 | 22.38200 | 22.53000 | 22.72100 | 22.9610 | 23.24300 | 23.53800 | 23.82300 | 24.08200 | 24.29700 | 24.44600 | 24.50200 | 24.43100 | 24.20700 | 23.82600 | 23.2920 | 22.60900 | 21.76800 | 20.76400 | 19.62700 | 18.41000 | 17.17200 | 15.99100 | 14.94700 | 14.09200 | 13.45700 | 13.06200 | 12.91000 | 12.98000 | 13.20800 | 13.52400 | 13.88400 | 14.24400 | 14.57200 | 14.85500 | 15.080 | 15.233 | 15.30400 | 15.29700 | 15.2320 | 15.12500 | 14.97900 | 14.79500 | 14.5680 | 14.29800 |
| American Samoa | ASM | Birth rate, crude (per 1,000 people) | SP.DYN.CBRT.IN | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 21.60000 | NA | NA | NA | 18.70000 | NA | NA | NA | 17.50000 | NA | NA | NA |
# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_birthrate == FALSE) {
# Convert year columns into a variable
birthrate_df <- birthrate_df %>%
select(Country, `2000`, `2005`, `2010`, `2015`) %>%
gather(Year, BirthRate, -Country) %>%
arrange(Country, Year) %>%
inner_join(iso3_codes)
# Fix datatypes
birthrate_df$Year <- as.numeric(birthrate_df$Year)
isTidy_birthrate <- TRUE # if this cell ir rerun, don't try to re-tidy
} else {
print('Already processed, nothing to do.')
}knitr::kable(birthrate_df[1:10, ], booktabs = T, caption = "Birthrate Data. Tidy data, first 10 rows shown. Births per 1000 people.")| Country | Year | BirthRate | ISO3 | ISO2 |
|---|---|---|---|---|
| Afghanistan | 2000 | 48.021 | AFG | AF |
| Afghanistan | 2005 | 44.723 | AFG | AF |
| Afghanistan | 2010 | 39.829 | AFG | AF |
| Afghanistan | 2015 | 34.809 | AFG | AF |
| Albania | 2000 | 16.436 | ALB | AL |
| Albania | 2005 | 12.821 | ALB | AL |
| Albania | 2010 | 12.001 | ALB | AL |
| Albania | 2015 | 12.197 | ALB | AL |
| Algeria | 2000 | 19.554 | DZA | DZ |
| Algeria | 2005 | 20.774 | DZA | DZ |
THis is the primary dataset which everything else is linked with. Here we see the maternal mortality rate provided as deaths per 100,000 live births broken out by country and year.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'https://data.unicef.org/wp-content/uploads/2015/11/MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
raw_fn <- 'MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
cache_fn <- 'mmr_data.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 4 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 17 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','2000','2005','2010','2015', '2017')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn <- paste('data/cache/', cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the dataset
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)## [1] "Found cached copy: data/cache/mmr_data.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
mmr_df <- data_df
isTidy_mmr <- FALSE
print(msg)## [1] "Cached CSV data loaded."
knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality Data. Count of maternal deaths per 100,000 live births. Raw data, first 10 rows shown. (Source: https://data.unicef.org/resources/dataset/maternal-health-data/)")| ISO3 | Country | 2000 | 2005 | 2010 | 2015 | 2017 |
|---|---|---|---|---|---|---|
| AFG | Afghanistan | 1450 | 1140 | 954 | 701 | 638 |
| ALB | Albania | 23 | 22 | 21 | 15 | 15 |
| DZA | Algeria | 161 | 127 | 115 | 114 | 112 |
| AGO | Angola | 827 | 519 | 326 | 251 | 241 |
| ATG | Antigua and Barbuda | 44 | 40 | 44 | 43 | 42 |
| ARG | Argentina | 66 | 59 | 51 | 41 | 39 |
| ARM | Armenia | 43 | 35 | 32 | 28 | 26 |
| AUS | Australia | 7 | 5 | 5 | 6 | 6 |
| AUT | Austria | 6 | 6 | 5 | 5 | 5 |
| AZE | Azerbaijan | 47 | 42 | 31 | 27 | 26 |
# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_mmr == FALSE) {
# Convert year columns into a variable
mmr_df <- mmr_df %>%
gather(Year, MMR, -ISO3, -Country)
# Fix datatypes
mmr_df$Year <- as.numeric(mmr_df$Year)
# Filter years
mmr_df <- mmr_df %>% filter(Year <= 2015) %>% arrange(Country, Year)
isTidy_mmr <- TRUE # if this cell ir rerun, don't try to re-tidy
} else {
print('Already processed, nothing to do.')
}knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality Data. Count of maternal deaths per 100,000 live births. Tidy data, first 10 rows shown. Years included: 2000, 2005, 2010, 2015.")| ISO3 | Country | Year | MMR |
|---|---|---|---|
| AFG | Afghanistan | 2000 | 1450 |
| AFG | Afghanistan | 2005 | 1140 |
| AFG | Afghanistan | 2010 | 954 |
| AFG | Afghanistan | 2015 | 701 |
| ALB | Albania | 2000 | 23 |
| ALB | Albania | 2005 | 22 |
| ALB | Albania | 2010 | 21 |
| ALB | Albania | 2015 | 15 |
| DZA | Algeria | 2000 | 161 |
| DZA | Algeria | 2005 | 127 |
UNICEF provides country level data describing a number of different indicators related to pregnancy and births. The downloaded raw dataset includes a number of different indicator tabs, “Skilled Attendant at Birth” (SAB) being only one of many. I’m curious how the presence of a trainined attendant might correlate with the more central question of maternal mortality during birth. UNICEF don’t report for all countries, only a subset, so that will limit my analysis to specific countries.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn <- 'maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
cache_fn <- 'sab_indicators.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 8 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 9 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 5 # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','Year','Total','Age_1517','Age_1819', 'Age_lt20', 'Age_gt20', 'Age_2034', 'Age_3549', 'Source', 'SourceYear')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn <- paste('data/cache/', cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the dataset
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)## [1] "Found cached copy: data/cache/sab_indicators.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
sab_df <- data_df
isTidy_sab <- FALSE
print(msg)## [1] "Cached CSV data loaded."
knitr::kable(sab_df[1:10, ], booktabs = T, caption = "Skilled Attendant at Birth. Raw data, first 10 rows shown. (Source: https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx)")| ISO3 | Country | Year | Total | Age_1517 | Age_1819 | Age_lt20 | Age_gt20 | Age_2034 | Age_3549 | Source | SourceYear |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AFG | Afghanistan | 2010 | 39.00000 | 34 | 35 | 34 | 37 | 37 | 37 | MICS | 2010 |
| ALB | Albania | 2000 | 99.12912 | 100 | 100 | 100 | 99.0734476224367 | 98.9604208782254 | 100 | MICS | 2000 |
| ALB | Albania | 2005 | 99.75600 | 100 | 100 | 100 | 99.7308564178849 | 99.7006924128754 | 100 | MICS | 2005 |
| ALB | Albania | 2008 | 99.43894 | 93.9805367626799 | 100 | 98.198223919787 | 99.5253293235263 | 99.5189609784249 | 99.6050081410941 | DHS | 2008 |
| ARM | Armenia | 2000 | 97.17686 | 96.1092640141004 | 97.3167730215472 | 97.112296301636 | 97.1883744086192 | 97.4217674005781 | 93.8763280483474 | DHS | 2000 |
| ARM | Armenia | 2005 | 98.58195 | 100 | 100 | 100 | 98.4219872151632 | 98.3336271503564 | 100 | DHS | 2005 |
| ARM | Armenia | 2010 | 100.00000 | 100 | 100 | 100 | 100 | 100 | 100 | DHS | 2010 |
| AZE | Azerbaijan | 2000 | 87.46867 | 100 | 91.6666666666667 | 94 | 86.5497076023392 | 87.3817034700316 | 76 | MICS | 2000 |
| AZE | Azerbaijan | 2006 | 89.58261 | 96.3219109387315 | 96.007394399452 | 96.0907370879405 | 88.9887563923704 | 89.1024951119957 | 87.6490910432445 | DHS | 2006 |
| BGD | Bangladesh | 2004 | 15.53558 | 14.659160649885 | 17.0460969662523 | 15.7557293822027 | 15.4272774340049 | 15.9158543553316 | 9.65134903482117 | DHS | 2004 |
isTidy_sab <- FALSE
# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_sab == FALSE) {
# While this is a wide dataset and it's trivial to gather() the age data, for my questions, I really don't need age. Since my MMR, Population and NHA data do not provide Age breakouts, I will only focus on the Total percentage.
sab_df <- sab_df %>% select(ISO3, Country, Year, Total, Source)
# Fix datatypes
sab_df$Year <- as.numeric(sab_df$Year)
sab_df$Total <- as.numeric(sab_df$Total)
# Since there are two different sources of data, we might have a case where they report for the same year. I'm going to group_by() and average the totals in the case where we have multiple data points.
sab_df <- sab_df %>% group_by(ISO3, Year) %>% summarize(SAB_Pct = mean(Total))
sab_df <- sab_df %>% filter(Year == 2000 | Year == 2005 | Year == 2010 | Year == 2015)
isTidy_sab <- TRUE # if this cell ir rerun, don't try to re-tidy
} else {
print('Already processed, nothing to do.')
}knitr::kable(sab_df[1:10, ], booktabs = T, caption = "Skilled Attendant at Birth. Tidy data, first 10 rows shown.")| ISO3 | Year | SAB_Pct |
|---|---|---|
| AFG | 2010 | 39.00000 |
| ALB | 2000 | 99.12912 |
| ALB | 2005 | 99.75600 |
| ARM | 2000 | 97.17686 |
| ARM | 2005 | 98.58195 |
| ARM | 2010 | 100.00000 |
| AZE | 2000 | 87.46867 |
| BDI | 2000 | 25.17007 |
| BDI | 2005 | 33.62995 |
| BDI | 2010 | 67.18792 |
WHO provides on online tool to export data from WHO member countries on a variety of meansures. After exploring options, I honed in on Health Expenditures. I am curious if spending on Health has any noticable correlation with maternal mortality outcomes. I manually downloaded the NHA_indicators.xlsx from their online tool.
reference: http://apps.who.int/nha/database/Home/Index/en source: http://apps.who.int/nha/database/ViewData/Indicators/en
# reference: http://apps.who.int/nha/database/Home/Index/en
# source: http://apps.who.int/nha/database/ViewData/Indicators/en
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
#source_url <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn <- 'NHA_indicators.xlsx'
cache_fn <- 'nha_indicators.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 2 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 0 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('Country','Indicator','Note','Year_2000','Year_2005','Year_2010', 'Year_2015')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn <- paste('data/cache/', cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Since data came from an online tool, we don't have a specific file we can download
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
print(msg)## [1] "Found cached copy: data/cache/nha_indicators.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
nha_df <- data_df
isTidy_nha <- FALSE
print(msg)## [1] "Cached CSV data loaded."
knitr::kable(nha_df[1:10, ], booktabs = T, caption = "WHO Health Expenditures Data. Raw data, first 10 rows shown. (Source: http://apps.who.int/nha/database/ViewData/Indicators/en)")| Country | Indicator | Note | Year_2000 | Year_2005 | Year_2010 | Year2015 |
|---|---|---|---|---|---|---|
| Algeria | Current Health Expenditure (CHE) per Capita in PPP | NA | 2.823850e+02 | 354.95766 | 645.28783 | 1020.54641 |
| Algeria | Gross Domestic Product | in million current PPP | 2.523851e+05 | 365236.38917 | 455452.36606 | 582720.55376 |
| Algeria | Purchasing Power Parity (NCU per Int$) | NA | 1.633813e+01 | 20.70440 | 26.32891 | 28.66228 |
| Algeria | Population (in thousands) | NA | 3.118366e+04 | 33288.43700 | 36117.63700 | 39871.52800 |
| Algeria | Female Population 15-49 | NA | 1.206122e+04 | 12739.57200 | 13204.20100 | 14043.06800 |
| Angola | Current Health Expenditure (CHE) per Capita in PPP | NA | 6.492660e+01 | 131.33186 | 158.17301 | 195.75735 |
| Angola | Gross Domestic Product | in million current PPP | 4.572640e+04 | 68754.61066 | 135067.31639 | 184842.95978 |
| Angola | Purchasing Power Parity (NCU per Int$) | NA | 2.004662e+00 | 35.79142 | 56.15439 | 66.65537 |
| Angola | Population (in thousands) | NA | 1.644092e+04 | 19552.54200 | 23369.13100 | 27859.30500 |
| Angola | Female Population 15-49 | NA | 6.121979e+03 | 7243.34500 | 8637.37000 | 10368.29500 |
# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_nha == FALSE) {
# Remove to extra Note column we won't need
nha_df <- nha_df %>% select(-Note)
# Convert year columns into a variable
nha_df <- nha_df %>%
gather(Year, Amount, -Country, -Indicator) %>%
mutate(Year = stringr::str_replace(Year, "Year_", "")) %>%
drop_na()
# Fix datatypes
nha_df$Year <- as.numeric(nha_df$Year)
nha_df$Amount <- as.numeric(nha_df$Amount)
# Break out CHE into a separate column
exp_data <- nha_df %>% filter(Indicator == 'Current Health Expenditure (CHE) per Capita in PPP') %>%
mutate(CHE = Amount) %>%
select(-Indicator, -Amount) %>%
arrange(Country, Year) %>%
drop_na()
# Break out GDP into a separate column
gdp_data <- nha_df %>% filter(Indicator == 'Gross Domestic Product') %>%
mutate(GDP = Amount) %>%
select(-Indicator, -Amount) %>%
arrange(Country, Year) %>%
drop_na()
# Join the columns back into a table based on Country & Year
nha_df <- gdp_data %>%
left_join(exp_data) %>%
left_join(iso3_codes)
isTidy_nha <- TRUE # if this cell ir rerun, don't try to re-tidy
} else {
print('Already processed, nothing to do.')
}knitr::kable(nha_df[1:10, ], booktabs = T, caption = "WHO Reported Health Expernditures & GDP by Country & Year. Tidy data, first 10 rows shown. Note CHE are adjusted with Purchase Price Parity (PPP) so dollars have equivalent spending power per country. This is also sometimes refered to as the 'Big Mac Index'")| Country | Year | GDP | CHE | ISO3 | ISO2 |
|---|---|---|---|---|---|
| Afghanistan | 2005 | 24851.345 | 98.61208 | AFG | AF |
| Afghanistan | 2010 | 44457.152 | 132.27128 | AFG | AF |
| Albania | 2000 | 11925.958 | 258.28327 | ALB | AL |
| Albania | 2005 | 17663.350 | 363.84904 | ALB | AL |
| Albania | 2010 | 28073.799 | 478.31474 | ALB | AL |
| Algeria | 2000 | 252385.066 | 282.38500 | DZA | DZ |
| Algeria | 2005 | 365236.389 | 354.95766 | DZA | DZ |
| Algeria | 2010 | 455452.366 | 645.28783 | DZA | DZ |
| Andorra | 2000 | 2132.781 | 3049.00618 | AND | AD |
| Andorra | 2005 | 3444.711 | 3741.40393 | AND | AD |
Note: PPP refers to “Price Point Parity”. This country-by-country adjustment reflects different purchasing power in different geos and attempts to normalize spending for better comparison across geos. We know intuititely that cost of living and items vary greatly around the world and $1 in the US doesn’t NOT have the same purchasing power as $1 in India. PPP normalizes so that spending can be compared across countries.
Note that we are combining data from a number of sources and years. As such there wil be a number of NA’s where data is missing in various columns. Here I also calculate the actual number of maternal deaths based on populations, birth rates and MMR rates.
# Use the current population by year * birthrate by year to get actual births
births_df <- birthrate_df %>%
inner_join(population_df) %>%
mutate(Births = round(Population / 1000 * BirthRate)) %>%
select(ISO3, Country, Year, Population, BirthRate, Births)
births_df <- births_df %>% left_join(iso3_codes)
# now that we have births, we can multiply by MMR (deaths per 100,000 births)
mmr_df <- mmr_df %>%
left_join(births_df) %>%
mutate(Deaths = round(Births / 10000 * MMR))
# join in Skilled Attendant at Birth (SAB)
mmr_df <- mmr_df %>%
left_join(sab_df) %>%
left_join(nha_df)knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortaility table with supporting columns.")| ISO3 | Country | Year | MMR | Population | BirthRate | Births | ISO2 | Deaths | SAB_Pct | GDP | CHE |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AFG | Afghanistan | 2000 | 1450 | 22462000 | 48.021 | 1078648 | AF | 156404 | NA | NA | NA |
| AFG | Afghanistan | 2005 | 1140 | 26335000 | 44.723 | 1177780 | AF | 134267 | NA | 24851.34 | 98.61208 |
| AFG | Afghanistan | 2010 | 954 | 29121000 | 39.829 | 1159860 | AF | 110651 | 39.00000 | 44457.15 | 132.27128 |
| AFG | Afghanistan | 2015 | 701 | 32565000 | 34.809 | 1133555 | AF | 79462 | NA | NA | NA |
| ALB | Albania | 2000 | 23 | 3159000 | 16.436 | 51921 | AL | 119 | 99.12912 | 11925.96 | 258.28327 |
| ALB | Albania | 2005 | 22 | 3025000 | 12.821 | 38784 | AL | 85 | 99.75600 | 17663.35 | 363.84904 |
| ALB | Albania | 2010 | 21 | 2987000 | 12.001 | 35847 | AL | 75 | NA | 28073.80 | 478.31474 |
| ALB | Albania | 2015 | 15 | 3030000 | 12.197 | 36957 | AL | 55 | NA | NA | NA |
| DZA | Algeria | 2000 | 161 | 30639000 | 19.554 | 599115 | DZ | 9646 | NA | 252385.07 | 282.38500 |
| DZA | Algeria | 2005 | 127 | 32918000 | 20.774 | 683839 | DZ | 8685 | NA | 365236.39 | 354.95766 |
We generally see a decline in mortality rates from 2000 to 2015 data.
data <- mmr_df %>% filter(Year==2000 | Year==2015) %>% select(MMR, Year)
data$Year <- as.factor(data$Year)
ggplot(data=data, aes(x=MMR, color=Year)) +
geom_histogram(fill='white', alpha=0.2, position="identity") +
scale_color_discrete(name="Year", labels=c(2000, 2015))Let’s see which countries have the most reduction in maternal deaths.
# Group data by country and calculate the delta in deaths over the years
tbl <- mmr_df %>%
group_by(Country) %>%
summarize(delta = round(first(Deaths) - last(Deaths))) %>%
arrange(desc(delta)) %>%
drop_na()
# barplot
ggplot(tbl %>% top_n(10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
ggtitle("Top 10 Countries with Reductions in Deaths") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = -0.5, vjust = 0.5 ) +
scale_y_discrete(labels = comma, breaks=c(1000, 100000, 200000, 600000)) +
ylim(c(0,700000)) +
xlab('Country') +
ylab('Deaths') +
theme(legend.position = "none") +
coord_flip()… and those with increases in maternal deaths. Nigeria is clearly experiencing a crisis.
# barplot
ggplot(tbl %>% top_n(-10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
ggtitle("Bottom 10 Countries with Increases in Deaths") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.5, vjust = 0.5 ) +
scale_y_discrete(labels = comma, breaks=c(-30000, -20000, -1000)) +
ylim(c(-25000,0)) +
xlab('Country') +
ylab('Deaths') +
theme(legend.position = "none") +
coord_flip()We see that generally, countries spend less than $4000 (PPP) adjusted dollars per person on health expenditures and as GDP increases, there is an increase in CHE. We some clear outliers with very high GDP countries spending very little and vice versa, countries with relatively low GDP spending reporting quite high CHE. While this is an interesting pattern, I suspect that CHE is not entriely reported nor being spent on the poorest population that might be more impacted by MMR. Unfortunaely, I don’t have the dataset to fully explore those questions.
ggplot(mmr_df, aes(GDP, CHE)) +
geom_point() +
labs(title='Country Health Expenditure (PPP) vs Country GDP (PPP)') +
xlab('GDP (Millions, PPP)') +
ylab('CHE (Dollars, PPP)')We do see a pattern where higher reported health expenditures correlate with lower maternal mortality and inversely, those countries (and years) with lower CHE see higher MMR.
ggplot(mmr_df, aes(CHE, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate (MMR) vs Country Health Expenditure (PPP)') +
xlab('CHE (Dollars, PPP)') +
ylab('MMR (per 100k live births)')The death count increases most dramatically with lower health expenditures and population impacts.
ggplot(mmr_df, aes(CHE, Deaths)) +
geom_point() +
labs(title='Maternal Deaths vs Country Health Expenditures (PPP)') +
xlab('CHE (Dollars, PPP)') +
ylab('Maternal Deaths')Generally the Maternal Mortality rate is higher with counties having lower GDP.
ggplot(mmr_df, aes(GDP, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate (MMR) vs Country GDP (Millions, PPP)') +
xlab('GDP (Millions, PPP)') +
ylab('MMR (per 100k live births)')While we don’t have many datapoints, there does appear to be a trend with lowered materal mortality when there is a higher percentage of births managed by a health professional. That said, a Skilled Attendant at birth may only be able to do so much … there may be affects earlier in pregnancy leading to mortality.
ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate vs Skilled Attendants at Birth (SAB)',
caption='note, we only have 98 datapoints with both SAB and MMR')+
xlab('Skilled Attendant at Birth (%)') +
ylab('MMR (per 100k live births)')As with mortality rate, we generally see fewer reported deaths as the percent of births are handled by a trained health professional.
ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, Deaths)) +
geom_point() +
scale_y_continuous(labels = comma) +
ylim(0, 300000) +
labs(title='Maternal Deaths vs Skilled Attendants at Birth (SAB)') +
xlab('Skilled Attendant at Birth (%)') +
ylab('Maternal Deaths')I started this project with a single core dataset with MMR by country and year. As I tried to answer questions, I found I needed additional data which lead to chasing down populations, birthrates and country codes. Only then did I seek out the Skilled Attendants and Health Spending data to look for patterns. Worldwide, the reported maternal mortality rate has dropped by ~40% which broadly shows up in my histograms. While the overall mortality rate is decreasing, some countries are greatly improving (e.g. India) while others have more deaths (e.g. Nigeria). Looking at health spend, GDP and presence of Skilled Attendants during birth, the later had the most notable correlation, but data was scarce. The NGO I worked with was targeting training of health workers, and from this quick survey of data, that is likely a fruitful approach to the problem.