DATA 607 Project 2

knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE, include=TRUE, error=FALSE)
library(tidyverse)
library(RMySQL)      # Older MysQL library (note: I could NOT get the RMariaDB lib to work)
library(formattable) # Pretty print tables
library(xlsx)
library(rvest)
library(stargazer)

Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Project2

Assignment

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Please include in your homework submission, for each of the three chosen datasets:

The URL to the .Rmd file in your GitHub repository, and
The URL for your rpubs.com web page.

Project

In the late 1990’s I worked for an NGO that focused on training doctors, nurses and midwives on women’s reproductive health issues and health during pregancy. The NGO developed and tested training programs in very rural areas around the world to teach health care providers how to provide care for women with no easy access to primary health care facilities or doctors.

In our Weekly Discussion forum, Euclid Zhang shared a link to UNICEF data and while browsing their datasets, I stumbled on a few datasets which really spoke to me and the incredible work I once contributed toward. The datasets provide data and trends in both maternal health and outcomes along with indicators by country. For Project 2, rather than pick 3 disparate datasets to analyze, I’m going to instead follow this theme of maternal outcome around the world.

For an understanding of the problem space, see Maternal mortality.

Some key facts they provide: - Every day in 2017, approximately 810 women died from preventable causes related to pregnancy and childbirth. - Between 2000 and 2017, the maternal mortality ratio (MMR, number of maternal deaths per 100,000 live births) dropped by about 38% worldwide. - 94% of all maternal deaths occur in low and lower middle-income countries. - Young adolescents (ages 10-14) face a higher risk of complications and death as a result of pregnancy than other women. - Skilled care before, during and after childbirth can save the lives of women and newborns.

The primary datasets I’ll explore are:

Each of these datasets required Tidy processing to clean up. I also join the data where possible by country and year to tease out additional patterns not available within any one dataset.

To facilitate joining tables using different Country label conventions, I scraped a lookup table with Country names and ISO3 codes which I join to my various datasets.

World ISO3 Country Codes

These datasets explore different years, but have broad overlap between 2000-2015 (every 5 years). My analysis will be somewhat limited by the available data. If I were doing a deeper dive, I would look for more complete data and/or limit my questions to those countries that have a richer set of data. My guess is that better data may indicate more attention to tackling problems with maternal mortality and as a consequence, better data might come from countries with better reporting and more on-the-ground efforts.

Load Data

Load Country Codes

During research I found that different datasets included ISO2 (Alpha 2), ISO3 (Alpha 3) and/or descriptive country names. Just in case it’s needed, I wanted a lookup table I can use to join different datasets using different country name conventions.

url             <- "https://www.iban.com/country-codes"
cache_fn        <- 'country_codes.csv'

isCacheFound    <- FALSE
local_cache_fn  <- paste('data/cache/', cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
}

# Cached ata file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(local_cache_fn, col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else {
  # Load Raw Data
  data_df <- url %>%
    xml2::read_html() %>%
    html_nodes(xpath='//*[@id="myTable"]') %>%
    html_table()
  
  data_df <- data_df[[1]]
  
  # Cache the processed as CSV for future
  write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}

iso3_codes <- data_df
isTidy_iso3_codes <- FALSE

print(msg)

## [1] "Cached CSV data loaded."

knitr::kable(iso3_codes[1:10, ], booktabs = T, caption = "ISO2 and ISO3 Country Codes.  Raw data, first 10 rows shown.  (Source: http://iban.com/country-codes)")

ISO2 and ISO3 Country Codes. Raw data, first 10 rows shown. (Source: http://iban.com/country-codes)
Country	Alpha-2 code	Alpha-3 code	Numeric
Afghanistan	AF	AFG	4
Åland Islands	AX	ALA	248
Albania	AL	ALB	8
Algeria	DZ	DZA	12
American Samoa	AS	ASM	16
Andorra	AD	AND	20
Angola	AO	AGO	24
Anguilla	AI	AIA	660
Antarctica	AQ	ATA	10
Antigua and Barbuda	AG	ATG	28

Tidy Country Codes

# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_iso3_codes == FALSE) {
  
  # Convert year columns into a variable
  iso3_codes <- iso3_codes %>% 
    mutate(ISO3 = `Alpha-3 code`, ISO2 = `Alpha-2 code`) %>%
    select(Country, ISO3, ISO2)
  
  isTidy_iso3_codes <- TRUE # if this cell ir rerun, don't try to re-tidy
  
} else {
  print('Already processed, nothing to do.')
}

knitr::kable(iso3_codes[1:10, ], booktabs = T, caption = "ISO2 and ISO3 Country Codes.  Tidy data, first 10 rows shown.")

ISO2 and ISO3 Country Codes. Tidy data, first 10 rows shown.
Country	ISO3	ISO2
Afghanistan	AFG	AF
Åland Islands	ALA	AX
Albania	ALB	AL
Algeria	DZA	DZ
American Samoa	ASM	AS
Andorra	AND	AD
Angola	AGO	AO
Anguilla	AIA	AI
Antarctica	ATA	AQ
Antigua and Barbuda	ATG	AG

Load Population Data

Changes in country population might change how we interpret maternal mortality rate. While the rate should theoretically be independent from population (by definition MMR is maternal deaths per 100,000 live births), if there is a large increase or decrease in population, this might impact conditions around pregnency (nutrition, access to health care, etc) and/or conditions at time of birth (e.g. access to health care). Since MMR is “deaths per live briths”, I’ll end up needing population data to help calculate the number of live births.

url        <- "https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population"
cache_fn   <- 'population.csv'

isCacheFound <- FALSE
local_cache_fn    <- paste('data/cache/', cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
}

# Cached ata file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(local_cache_fn, col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else {
  # Load Raw Data
  data_df <- url %>%
    xml2::read_html() %>%
    html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
    html_table()
  
  data_df <- data_df[[1]]
  
  # Cache the processed as CSV for future
  write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}

population_df <- data_df
isTidy_population <- FALSE

print(msg)

## [1] "Cached CSV data loaded."

knitr::kable(population_df[1:10, ], booktabs = T, caption = "Population Data.  Raw data, first 10 rows shown.  (Source: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)")

Population Data. Raw data, first 10 rows shown. (Source: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)
Country (or dependent territory)	1985	%	1990	%_1	1995	%_2	2000	%_3	2005	%_4	2010	%_5	2015	%_6
Afghanistan	13120	-2.70	13569	0.67	19446	7.46	22462	2.93	26335	3.23	29121	2.03	32565	2.26
Albania	2957	2.05	3245	1.88	3159	-0.54	3159	0.00	3025	-0.86	2987	-0.25	3030	0.28
Algeria	22009	3.19	25191	2.74	28322	2.37	30639	1.58	32918	1.45	35950	1.78	39543	1.92
American Samoa	39	3.57	48	4.09	54	2.69	58	1.39	57	-0.28	56	-0.53	55	-0.41
Andorra	45	5.84	53	3.41	64	3.70	66	0.58	77	3.18	85	2.12	86	0.25
Angola	8390	3.09	9486	2.48	11000	3.01	12683	2.89	14770	3.09	17043	2.90	19626	2.86
Anguilla	7	1.35	9	3.84	10	3.20	12	3.02	14	2.80	15	2.40	17	2.14
Antigua and Barbuda	65	-1.24	65	-0.07	69	1.38	76	1.86	82	1.53	87	1.31	93	1.28
Argentina	30672	1.57	33036	1.50	35274	1.32	37336	1.14	39182	0.97	41344	1.08	43432	0.99
Armenia	3465	2.02	3530	0.37	3131	-2.37	3101	-0.19	3085	-0.10	3072	-0.08	3057	-0.10

Tidy Population Data

# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_population == FALSE) {
  
  # Convert year columns into a variable
  population_df <- population_df %>% 
    mutate(Country = `Country (or dependent territory)`) %>%
    select(Country, `2000`, `2005`, `2010`, `2015`) %>%
    gather(Year, Population, -Country)

  # Fix datatypes
  population_df$Year <- as.numeric(population_df$Year)

  # Filter years  
  population_df <- population_df %>% filter(Year >= 2000 & Year <= 2015) %>% arrange(Country, Year)
  population_df$Population <- population_df$Population* 1000
  population_df <- population_df %>% left_join(iso3_codes)
  
  isTidy_population <- TRUE # if this cell ir rerun, don't try to re-tidy
  
} else {
  print('Already processed, nothing to do.')
}

knitr::kable(population_df[1:10, ], booktabs = T, caption = "Population Data.  Tidy date, first 10 rows shown.")

Population Data. Tidy date, first 10 rows shown.
Country	Year	Population	ISO3	ISO2
Afghanistan	2000	22462000	AFG	AF
Afghanistan	2005	26335000	AFG	AF
Afghanistan	2010	29121000	AFG	AF
Afghanistan	2015	32565000	AFG	AF
Albania	2000	3159000	ALB	AL
Albania	2005	3025000	ALB	AL
Albania	2010	2987000	ALB	AL
Albania	2015	3030000	ALB	AL
Algeria	2000	30639000	DZA	DZ
Algeria	2005	32918000	DZA	DZ

Load Birthrate Data

While there is variation between countries, the overall birthrate worldwide has declined. To understand changes in maternal mortality, we will need to estimate how many births are occuring during years of interest in each country. It’s possible that changes in birth rates could have an impact on MMR.

# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
source_url        <- 'http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=excel'
raw_fn            <- 'API_SP.DYN.CBRT.IN_DS2_en_excel_v2_248743.xlsx'
cache_fn          <- 'births.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 3  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 0  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 1      # optional for Excel Workbooks
data_column_names <- c('Country', 'ISO3', 'Indicator', 'Indicator_Code', as.character(seq(1960, 2017, 1)))

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn    <- paste('data/cache/', cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Attempt to download the dataset
  download.file(source_url, local_raw_fn, method="auto")
  
  if(file.size(local_raw_fn) > 0) {
    msg <- paste('Downloaded file from: ', source_url, sep='')
    isRawFound <- TRUE
  } else {
    msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
  }
}

print(msg)

## [1] "Found cached copy: data/cache/births.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

birthrate_df   <- data_df
isTidy_birthrate <- FALSE

print(msg)

## [1] "Cached CSV data loaded."

knitr::kable(birthrate_df[1:10, ], booktabs = T, caption = "Birthrate Data.  Raw data, first 10 rows shown.  Births per 1000 people.  (Source: http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=csv)")

Birthrate Data. Raw data, first 10 rows shown. Births per 1000 people. (Source: http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=csv)
Country	ISO3	Indicator	Indicator_Code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017
Aruba	ABW	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	35.67900	34.52900	33.3200	32.05000	30.73700	29.4130	28.12100	26.90800	25.81700	24.87200	24.09900	23.50500	23.06800	22.76000	22.56100	22.45200	22.41400	22.4240	22.45400	22.47800	22.47200	22.42400	22.32900	22.18700	21.98900	21.72600	21.39700	21.00800	20.5700	20.08900	19.57100	19.02100	18.44600	17.85900	17.27000	16.69100	16.13200	15.59800	15.09000	14.61500	14.17300	13.76200	13.37500	13.01000	12.66700	12.34800	12.05300	11.78800	11.556	11.361	11.21400	11.12300	11.0900	11.11100	11.17900	11.28100	11.4040	11.53200
Afghanistan	AFG	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	51.27900	51.37300	51.4570	51.53000	51.58900	51.6310	51.65200	51.65000	51.62300	51.57400	51.50200	51.41100	51.30300	51.18400	51.05800	50.93000	50.80300	50.6780	50.55500	50.43600	50.32100	50.21000	50.09800	49.98400	49.86500	49.73500	49.58600	49.41800	49.2360	49.04800	48.88000	48.76300	48.70900	48.71700	48.77000	48.83500	48.87000	48.83300	48.68800	48.41900	48.02100	47.50500	46.90100	46.23100	45.50700	44.72300	43.87000	42.94400	41.949	40.903	39.82900	38.75000	37.6900	36.67000	35.70600	34.80900	33.9810	33.21100
Angola	AGO	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	49.08000	48.77900	48.5470	48.43000	48.45000	48.6220	48.93600	49.34300	49.78700	50.23100	50.61900	50.90300	51.06200	51.09400	51.00500	50.82500	50.60000	50.3860	50.22600	50.13900	50.13400	50.20700	50.32200	50.44900	50.56900	50.66300	50.71200	50.71100	50.6570	50.54700	50.38300	50.16800	49.91900	49.65200	49.37800	49.11300	48.87000	48.65200	48.46000	48.29300	48.15000	48.02700	47.91100	47.78600	47.63900	47.45300	47.21500	46.92000	46.563	46.143	45.65600	45.10200	44.4930	43.84700	43.18200	42.52000	41.8820	41.28100
Albania	ALB	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	40.92400	40.36800	39.6270	38.72300	37.69500	36.5990	35.49600	34.43500	33.45800	32.59000	31.83700	31.18300	30.58700	30.01900	29.47300	28.94900	28.45500	28.0040	27.60600	27.26200	26.98100	26.77200	26.62700	26.52800	26.45200	26.36700	26.24100	26.04700	25.7620	25.37200	24.86700	24.24500	23.52900	22.74200	21.90200	21.02000	20.10600	19.17300	18.23800	17.32100	16.43600	15.59000	14.79000	14.04800	13.38100	12.82100	12.39800	12.11800	11.973	11.945	12.00100	12.10000	12.1970	12.25700	12.25900	12.19700	12.0800	11.93400
Andorra	AND	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	11.90000	11.00000	11.6000	12.50000	11.90000	11.90000	12.10000	11.40000	10.90000	11.00000	10.90000	11.20000	11.90000	12.60000	11.30000	11.80000	11.20000	10.30000	10.90000	10.70000	10.60000	10.10000	10.400	9.900	9.80000	NA	9.5000	NA	NA	NA	8.8000	NA
Arab World	ARB	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	47.79008	47.55839	47.3276	47.09162	46.84421	46.5771	46.28291	45.96055	45.61137	45.23716	44.84362	44.44035	44.03865	43.64783	43.27485	42.92493	42.60063	42.2929	41.98992	41.68051	41.34983	40.98353	40.57024	40.10007	39.56784	38.96898	38.30287	37.58055	36.8177	36.02979	35.31807	34.52188	33.83458	33.04784	32.28643	31.50175	30.83459	30.21717	29.64974	29.14368	28.70911	28.34776	28.05892	27.83078	27.65783	27.53771	27.47728	27.46073	27.463	27.496	27.51344	27.48487	27.3893	27.21144	26.94078	26.57699	26.1348	25.64802
United Arab Emirates	ARE	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	47.26600	46.70500	46.0930	45.42600	44.69400	43.8560	42.85700	41.67800	40.32900	38.84300	37.27600	35.69500	34.17900	32.80400	31.62500	30.70300	30.08200	29.7350	29.61100	29.66600	29.83900	30.05900	30.25100	30.34300	30.28200	30.02200	29.53700	28.85900	28.0260	27.05900	25.98400	24.82800	23.63100	22.43700	21.28200	20.21000	19.25300	18.41000	17.67500	17.04300	16.49400	15.99600	15.51400	15.02400	14.51800	14.00100	13.49000	13.01000	12.581	12.208	11.88800	11.61600	11.3770	11.15900	10.95800	10.77200	10.6030	10.45600
Argentina	ARG	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	23.77500	23.56900	23.3720	23.19100	23.03300	22.9090	22.83000	22.80400	22.83400	22.92400	23.09000	23.35000	23.69500	24.09800	24.52400	24.91200	25.19100	25.3200	25.27700	25.06600	24.71800	24.28800	23.84600	23.45100	23.12800	22.88900	22.72700	22.60600	22.4930	22.37200	22.22600	22.03600	21.80200	21.52600	21.21400	20.88100	20.54500	20.22900	19.94500	19.70200	19.49900	19.32800	19.17500	19.02600	18.87900	18.73500	18.59600	18.46900	18.355	18.252	18.15400	18.05500	17.9490	17.83100	17.69800	17.54800	17.3830	17.20500
Armenia	ARM	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	38.50900	37.21700	35.5610	33.62900	31.54000	29.4470	27.49600	25.78600	24.38900	23.35500	22.69400	22.36600	22.26200	22.28300	22.38200	22.53000	22.72100	22.9610	23.24300	23.53800	23.82300	24.08200	24.29700	24.44600	24.50200	24.43100	24.20700	23.82600	23.2920	22.60900	21.76800	20.76400	19.62700	18.41000	17.17200	15.99100	14.94700	14.09200	13.45700	13.06200	12.91000	12.98000	13.20800	13.52400	13.88400	14.24400	14.57200	14.85500	15.080	15.233	15.30400	15.29700	15.2320	15.12500	14.97900	14.79500	14.5680	14.29800
American Samoa	ASM	Birth rate, crude (per 1,000 people)	SP.DYN.CBRT.IN	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	21.60000	NA	NA	NA	18.70000	NA	NA	NA	17.50000	NA	NA	NA

Tidy Birthrate Data

# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_birthrate == FALSE) {
  
  # Convert year columns into a variable
  birthrate_df <- birthrate_df %>% 
    select(Country, `2000`, `2005`, `2010`, `2015`) %>%
    gather(Year, BirthRate, -Country) %>% 
    arrange(Country, Year) %>% 
    inner_join(iso3_codes)

  # Fix datatypes
  birthrate_df$Year <- as.numeric(birthrate_df$Year)

  isTidy_birthrate <- TRUE # if this cell ir rerun, don't try to re-tidy
  
} else {
  print('Already processed, nothing to do.')
}

knitr::kable(birthrate_df[1:10, ], booktabs = T, caption = "Birthrate Data.  Tidy data, first 10 rows shown.  Births per 1000 people.")

Birthrate Data. Tidy data, first 10 rows shown. Births per 1000 people.
Country	Year	BirthRate	ISO3	ISO2
Afghanistan	2000	48.021	AFG	AF
Afghanistan	2005	44.723	AFG	AF
Afghanistan	2010	39.829	AFG	AF
Afghanistan	2015	34.809	AFG	AF
Albania	2000	16.436	ALB	AL
Albania	2005	12.821	ALB	AL
Albania	2010	12.001	ALB	AL
Albania	2015	12.197	ALB	AL
Algeria	2000	19.554	DZA	DZ
Algeria	2005	20.774	DZA	DZ

Load Maternal Mortality Data

THis is the primary dataset which everything else is linked with. Here we see the maternal mortality rate provided as deaths per 100,000 live births broken out by country and year.

# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
source_url        <- 'https://data.unicef.org/wp-content/uploads/2015/11/MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
raw_fn            <- 'MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
cache_fn          <- 'mmr_data.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 4  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 17  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 1      # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','2000','2005','2010','2015', '2017')

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn    <- paste('data/cache/', cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Attempt to download the dataset
  download.file(source_url, local_raw_fn, method="auto")
  
  if(file.size(local_raw_fn) > 0) {
    msg <- paste('Downloaded file from: ', source_url, sep='')
    isRawFound <- TRUE
  } else {
    msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
  }
}

print(msg)

## [1] "Found cached copy: data/cache/mmr_data.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

mmr_df <- data_df
isTidy_mmr   <- FALSE

print(msg)

## [1] "Cached CSV data loaded."

knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality Data.  Count of maternal deaths per 100,000 live births.  Raw data, first 10 rows shown.  (Source: https://data.unicef.org/resources/dataset/maternal-health-data/)")

Maternal Mortality Data. Count of maternal deaths per 100,000 live births. Raw data, first 10 rows shown. (Source: https://data.unicef.org/resources/dataset/maternal-health-data/)
ISO3	Country	2000	2005	2010	2015	2017
AFG	Afghanistan	1450	1140	954	701	638
ALB	Albania	23	22	21	15	15
DZA	Algeria	161	127	115	114	112
AGO	Angola	827	519	326	251	241
ATG	Antigua and Barbuda	44	40	44	43	42
ARG	Argentina	66	59	51	41	39
ARM	Armenia	43	35	32	28	26
AUS	Australia	7	5	5	6	6
AUT	Austria	6	6	5	5	5
AZE	Azerbaijan	47	42	31	27	26

Tidy MMR Dataset

# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_mmr == FALSE) {
  # Convert year columns into a variable
  mmr_df <- mmr_df %>% 
    gather(Year, MMR, -ISO3, -Country)

  # Fix datatypes
  mmr_df$Year <- as.numeric(mmr_df$Year)

  # Filter years  
  mmr_df <- mmr_df %>% filter(Year <= 2015) %>% arrange(Country, Year)

  isTidy_mmr <- TRUE # if this cell ir rerun, don't try to re-tidy
  
} else {
  print('Already processed, nothing to do.')
}

knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality Data.  Count of maternal deaths per 100,000 live births.  Tidy data, first 10 rows shown. Years included: 2000, 2005, 2010, 2015.")

Maternal Mortality Data. Count of maternal deaths per 100,000 live births. Tidy data, first 10 rows shown. Years included: 2000, 2005, 2010, 2015.
ISO3	Country	Year	MMR
AFG	Afghanistan	2000	1450
AFG	Afghanistan	2005	1140
AFG	Afghanistan	2010	954
AFG	Afghanistan	2015	701
ALB	Albania	2000	23
ALB	Albania	2005	22
ALB	Albania	2010	21
ALB	Albania	2015	15
DZA	Algeria	2000	161
DZA	Algeria	2005	127

Load Skilled Attendant Data

UNICEF provides country level data describing a number of different indicators related to pregnancy and births. The downloaded raw dataset includes a number of different indicator tabs, “Skilled Attendant at Birth” (SAB) being only one of many. I’m curious how the presence of a trainined attendant might correlate with the more central question of maternal mortality during birth. UNICEF don’t report for all countries, only a subset, so that will limit my analysis to specific countries.

# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
source_url        <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn            <- 'maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
cache_fn          <- 'sab_indicators.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 8  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 9  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 5      # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','Year','Total','Age_1517','Age_1819', 'Age_lt20', 'Age_gt20', 'Age_2034', 'Age_3549', 'Source', 'SourceYear')

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn    <- paste('data/cache/', cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Attempt to download the dataset
  download.file(source_url, local_raw_fn, method="auto")
  
  if(file.size(local_raw_fn) > 0) {
    msg <- paste('Downloaded file from: ', source_url, sep='')
    isRawFound <- TRUE
  } else {
    msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
  }
}

print(msg)

## [1] "Found cached copy: data/cache/sab_indicators.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

sab_df   <- data_df
isTidy_sab <- FALSE

print(msg)

## [1] "Cached CSV data loaded."

knitr::kable(sab_df[1:10, ], booktabs = T, caption = "Skilled Attendant at Birth.  Raw data, first 10 rows shown.  (Source: https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx)")

Skilled Attendant at Birth. Raw data, first 10 rows shown. (Source: https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx)
ISO3	Country	Year	Total	Age_1517	Age_1819	Age_lt20	Age_gt20	Age_2034	Age_3549	Source	SourceYear
AFG	Afghanistan	2010	39.00000	34	35	34	37	37	37	MICS	2010
ALB	Albania	2000	99.12912	100	100	100	99.0734476224367	98.9604208782254	100	MICS	2000
ALB	Albania	2005	99.75600	100	100	100	99.7308564178849	99.7006924128754	100	MICS	2005
ALB	Albania	2008	99.43894	93.9805367626799	100	98.198223919787	99.5253293235263	99.5189609784249	99.6050081410941	DHS	2008
ARM	Armenia	2000	97.17686	96.1092640141004	97.3167730215472	97.112296301636	97.1883744086192	97.4217674005781	93.8763280483474	DHS	2000
ARM	Armenia	2005	98.58195	100	100	100	98.4219872151632	98.3336271503564	100	DHS	2005
ARM	Armenia	2010	100.00000	100	100	100	100	100	100	DHS	2010
AZE	Azerbaijan	2000	87.46867	100	91.6666666666667	94	86.5497076023392	87.3817034700316	76	MICS	2000
AZE	Azerbaijan	2006	89.58261	96.3219109387315	96.007394399452	96.0907370879405	88.9887563923704	89.1024951119957	87.6490910432445	DHS	2006
BGD	Bangladesh	2004	15.53558	14.659160649885	17.0460969662523	15.7557293822027	15.4272774340049	15.9158543553316	9.65134903482117	DHS	2004

Tidy SAB Dataset

isTidy_sab   <- FALSE

# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_sab == FALSE) {
  # While this is a wide dataset and it's trivial to gather() the age data, for my questions, I really don't need age.  Since my MMR, Population and NHA data do not provide Age breakouts, I will only focus on the Total percentage.
  sab_df <- sab_df %>% select(ISO3, Country, Year, Total, Source)
  
  # Fix datatypes
  sab_df$Year <- as.numeric(sab_df$Year)
  sab_df$Total <- as.numeric(sab_df$Total)
  
  # Since there are two different sources of data, we might have a case where they report for the same year.  I'm going to group_by() and average the totals in the case where we have multiple data points.
  sab_df <- sab_df %>% group_by(ISO3, Year) %>% summarize(SAB_Pct = mean(Total))
  sab_df <- sab_df %>% filter(Year == 2000 | Year == 2005 | Year == 2010 | Year == 2015)
  
  isTidy_sab <- TRUE # if this cell ir rerun, don't try to re-tidy

} else {
  print('Already processed, nothing to do.')
}

knitr::kable(sab_df[1:10, ], booktabs = T, caption = "Skilled Attendant at Birth.  Tidy data, first 10 rows shown.")

Skilled Attendant at Birth. Tidy data, first 10 rows shown.
ISO3	Year	SAB_Pct
AFG	2010	39.00000
ALB	2000	99.12912
ALB	2005	99.75600
ARM	2000	97.17686
ARM	2005	98.58195
ARM	2010	100.00000
AZE	2000	87.46867
BDI	2000	25.17007
BDI	2005	33.62995
BDI	2010	67.18792

Load Health Expenditures Data

WHO provides on online tool to export data from WHO member countries on a variety of meansures. After exploring options, I honed in on Health Expenditures. I am curious if spending on Health has any noticable correlation with maternal mortality outcomes. I manually downloaded the NHA_indicators.xlsx from their online tool.

reference: http://apps.who.int/nha/database/Home/Index/en source: http://apps.who.int/nha/database/ViewData/Indicators/en

# reference: http://apps.who.int/nha/database/Home/Index/en
# source: http://apps.who.int/nha/database/ViewData/Indicators/en

# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
#source_url        <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn            <- 'NHA_indicators.xlsx'
cache_fn          <- 'nha_indicators.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 2  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 0  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 1      # optional for Excel Workbooks
data_column_names <- c('Country','Indicator','Note','Year_2000','Year_2005','Year_2010', 'Year_2015')

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn    <- paste('data/cache/', cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Since data came from an online tool, we don't have a specific file we can download
  msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}

print(msg)

## [1] "Found cached copy: data/cache/nha_indicators.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

nha_df <- data_df
isTidy_nha   <- FALSE

print(msg)

## [1] "Cached CSV data loaded."

knitr::kable(nha_df[1:10, ], booktabs = T, caption = "WHO Health Expenditures Data.  Raw data, first 10 rows shown.  (Source: http://apps.who.int/nha/database/ViewData/Indicators/en)")

WHO Health Expenditures Data. Raw data, first 10 rows shown. (Source: http://apps.who.int/nha/database/ViewData/Indicators/en)
Country	Indicator	Note	Year_2000	Year_2005	Year_2010	Year2015
Algeria	Current Health Expenditure (CHE) per Capita in PPP	NA	2.823850e+02	354.95766	645.28783	1020.54641
Algeria	Gross Domestic Product	in million current PPP	2.523851e+05	365236.38917	455452.36606	582720.55376
Algeria	Purchasing Power Parity (NCU per Int$)	NA	1.633813e+01	20.70440	26.32891	28.66228
Algeria	Population (in thousands)	NA	3.118366e+04	33288.43700	36117.63700	39871.52800
Algeria	Female Population 15-49	NA	1.206122e+04	12739.57200	13204.20100	14043.06800
Angola	Current Health Expenditure (CHE) per Capita in PPP	NA	6.492660e+01	131.33186	158.17301	195.75735
Angola	Gross Domestic Product	in million current PPP	4.572640e+04	68754.61066	135067.31639	184842.95978
Angola	Purchasing Power Parity (NCU per Int$)	NA	2.004662e+00	35.79142	56.15439	66.65537
Angola	Population (in thousands)	NA	1.644092e+04	19552.54200	23369.13100	27859.30500
Angola	Female Population 15-49	NA	6.121979e+03	7243.34500	8637.37000	10368.29500

Tidy NHA Dataset

# If rerunning our RMarkdown, only tidy if necessary
if (isTidy_nha == FALSE) {
  # Remove to extra Note column we won't need
  nha_df <- nha_df %>% select(-Note)
  
  # Convert year columns into a variable
  nha_df <- nha_df %>%
    gather(Year, Amount, -Country, -Indicator) %>% 
    mutate(Year = stringr::str_replace(Year, "Year_", "")) %>% 
    drop_na()

  # Fix datatypes
  nha_df$Year <- as.numeric(nha_df$Year)
  nha_df$Amount <- as.numeric(nha_df$Amount)
  
  # Break out CHE into a separate column
  exp_data <- nha_df %>% filter(Indicator == 'Current Health Expenditure (CHE) per Capita in PPP') %>% 
    mutate(CHE = Amount) %>%
    select(-Indicator, -Amount) %>%
    arrange(Country, Year) %>% 
    drop_na()
  
  # Break out GDP into a separate column
  gdp_data <- nha_df %>% filter(Indicator == 'Gross Domestic Product') %>% 
    mutate(GDP = Amount) %>%
    select(-Indicator, -Amount) %>%    
    arrange(Country, Year) %>% 
    drop_na()
  
  # Join the columns back into a table based on Country & Year
  nha_df <- gdp_data %>% 
    left_join(exp_data) %>%
    left_join(iso3_codes)

  isTidy_nha <- TRUE # if this cell ir rerun, don't try to re-tidy
  
} else {
  print('Already processed, nothing to do.')
}

knitr::kable(nha_df[1:10, ], booktabs = T, caption = "WHO Reported Health Expernditures & GDP by Country & Year.  Tidy data, first 10 rows shown. Note CHE are adjusted with Purchase Price Parity (PPP) so dollars have equivalent spending power per country.  This is also sometimes refered to as the 'Big Mac Index'")

WHO Reported Health Expernditures & GDP by Country & Year. Tidy data, first 10 rows shown. Note CHE are adjusted with Purchase Price Parity (PPP) so dollars have equivalent spending power per country. This is also sometimes refered to as the ‘Big Mac Index’
Country	Year	GDP	CHE	ISO3	ISO2
Afghanistan	2005	24851.345	98.61208	AFG	AF
Afghanistan	2010	44457.152	132.27128	AFG	AF
Albania	2000	11925.958	258.28327	ALB	AL
Albania	2005	17663.350	363.84904	ALB	AL
Albania	2010	28073.799	478.31474	ALB	AL
Algeria	2000	252385.066	282.38500	DZA	DZ
Algeria	2005	365236.389	354.95766	DZA	DZ
Algeria	2010	455452.366	645.28783	DZA	DZ
Andorra	2000	2132.781	3049.00618	AND	AD
Andorra	2005	3444.711	3741.40393	AND	AD

Analysis

Note: PPP refers to “Price Point Parity”. This country-by-country adjustment reflects different purchasing power in different geos and attempts to normalize spending for better comparison across geos. We know intuititely that cost of living and items vary greatly around the world and $1 in the US doesn’t NOT have the same purchasing power as $1 in India. PPP normalizes so that spending can be compared across countries.

Join Tables

Note that we are combining data from a number of sources and years. As such there wil be a number of NA’s where data is missing in various columns. Here I also calculate the actual number of maternal deaths based on populations, birth rates and MMR rates.

# Use the current population by year * birthrate by year to get actual births
births_df <- birthrate_df %>%
  inner_join(population_df) %>%
  mutate(Births = round(Population / 1000 * BirthRate)) %>% 
  select(ISO3, Country, Year, Population, BirthRate, Births)
births_df <- births_df %>% left_join(iso3_codes)

# now that we have births, we can multiply by MMR (deaths per 100,000 births)
mmr_df <- mmr_df %>% 
  left_join(births_df) %>%
  mutate(Deaths = round(Births / 10000 * MMR))
  
# join in Skilled Attendant at Birth (SAB)
mmr_df <- mmr_df %>%
  left_join(sab_df) %>%
  left_join(nha_df)

knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortaility table with supporting columns.")

Maternal Mortaility table with supporting columns.
ISO3	Country	Year	MMR	Population	BirthRate	Births	ISO2	Deaths	SAB_Pct	GDP	CHE
AFG	Afghanistan	2000	1450	22462000	48.021	1078648	AF	156404	NA	NA	NA
AFG	Afghanistan	2005	1140	26335000	44.723	1177780	AF	134267	NA	24851.34	98.61208
AFG	Afghanistan	2010	954	29121000	39.829	1159860	AF	110651	39.00000	44457.15	132.27128
AFG	Afghanistan	2015	701	32565000	34.809	1133555	AF	79462	NA	NA	NA
ALB	Albania	2000	23	3159000	16.436	51921	AL	119	99.12912	11925.96	258.28327
ALB	Albania	2005	22	3025000	12.821	38784	AL	85	99.75600	17663.35	363.84904
ALB	Albania	2010	21	2987000	12.001	35847	AL	75	NA	28073.80	478.31474
ALB	Albania	2015	15	3030000	12.197	36957	AL	55	NA	NA	NA
DZA	Algeria	2000	161	30639000	19.554	599115	DZ	9646	NA	252385.07	282.38500
DZA	Algeria	2005	127	32918000	20.774	683839	DZ	8685	NA	365236.39	354.95766

MMR from 2000 to 2015

We generally see a decline in mortality rates from 2000 to 2015 data.

data <- mmr_df %>% filter(Year==2000 | Year==2015) %>% select(MMR, Year)
data$Year <- as.factor(data$Year)

ggplot(data=data, aes(x=MMR, color=Year)) +
  geom_histogram(fill='white', alpha=0.2, position="identity") +
  scale_color_discrete(name="Year", labels=c(2000, 2015))

Changes in Mortality by Country

Let’s see which countries have the most reduction in maternal deaths.

# Group data by country and calculate the delta in deaths over the years
tbl <- mmr_df %>% 
  group_by(Country) %>% 
  summarize(delta = round(first(Deaths) - last(Deaths))) %>%
  arrange(desc(delta)) %>%
  drop_na()

# barplot
ggplot(tbl %>% top_n(10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
  ggtitle("Top 10 Countries with Reductions in Deaths") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_bar(position='dodge', stat="identity") + 
  geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = -0.5, vjust = 0.5 ) +
  scale_y_discrete(labels = comma, breaks=c(1000, 100000, 200000, 600000)) +
  ylim(c(0,700000)) +
  xlab('Country') + 
  ylab('Deaths') +
  theme(legend.position = "none") + 
  coord_flip()

… and those with increases in maternal deaths. Nigeria is clearly experiencing a crisis.

# barplot
ggplot(tbl %>% top_n(-10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
  ggtitle("Bottom 10 Countries with Increases in Deaths") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_bar(position='dodge', stat="identity") + 
  geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.5, vjust = 0.5 ) +
  scale_y_discrete(labels = comma, breaks=c(-30000, -20000, -1000)) +
  ylim(c(-25000,0)) +
  xlab('Country') + 
  ylab('Deaths') +
  theme(legend.position = "none") +
  coord_flip()

# Group data by country and calculate the delta in deaths over the years
mmr_df <- mmr_df %>% 
  group_by(Country) %>% 
  mutate(delta = Deaths - lag(Deaths, order_by=Year, default=first(Deaths)))

Health Expenditures vs GDP

We see that generally, countries spend less than $4000 (PPP) adjusted dollars per person on health expenditures and as GDP increases, there is an increase in CHE. We some clear outliers with very high GDP countries spending very little and vice versa, countries with relatively low GDP spending reporting quite high CHE. While this is an interesting pattern, I suspect that CHE is not entriely reported nor being spent on the poorest population that might be more impacted by MMR. Unfortunaely, I don’t have the dataset to fully explore those questions.

ggplot(mmr_df, aes(GDP, CHE)) +
  geom_point() +
  labs(title='Country Health Expenditure (PPP) vs Country GDP (PPP)') +
  xlab('GDP (Millions, PPP)') +
  ylab('CHE (Dollars, PPP)')

MMR vs CHE

We do see a pattern where higher reported health expenditures correlate with lower maternal mortality and inversely, those countries (and years) with lower CHE see higher MMR.

ggplot(mmr_df, aes(CHE, MMR)) +
  geom_point() +
  labs(title='Maternal Mortality Rate (MMR) vs Country Health Expenditure (PPP)') +
  xlab('CHE (Dollars, PPP)') +
  ylab('MMR (per 100k live births)')

Maternal Deaths vs CHE

The death count increases most dramatically with lower health expenditures and population impacts.

ggplot(mmr_df, aes(CHE, Deaths)) +
  geom_point() +
  labs(title='Maternal Deaths vs Country Health Expenditures (PPP)') +
  xlab('CHE (Dollars, PPP)') +
  ylab('Maternal Deaths')

MMR vs GDP

Generally the Maternal Mortality rate is higher with counties having lower GDP.

ggplot(mmr_df, aes(GDP, MMR)) +
  geom_point() +
  labs(title='Maternal Mortality Rate (MMR) vs Country GDP (Millions, PPP)') +
  xlab('GDP (Millions, PPP)') +
  ylab('MMR (per 100k live births)')

Skilled Attendants and MMR

While we don’t have many datapoints, there does appear to be a trend with lowered materal mortality when there is a higher percentage of births managed by a health professional. That said, a Skilled Attendant at birth may only be able to do so much … there may be affects earlier in pregnancy leading to mortality.

ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, MMR)) +
  geom_point() +
  labs(title='Maternal Mortality Rate vs Skilled Attendants at Birth (SAB)', 
       caption='note, we only have 98 datapoints with both SAB and MMR')+
  xlab('Skilled Attendant at Birth (%)') +
  ylab('MMR (per 100k live births)')

Skilled Attendants and Deaths

As with mortality rate, we generally see fewer reported deaths as the percent of births are handled by a trained health professional.

ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, Deaths)) +
  geom_point() + 
  scale_y_continuous(labels = comma) +
  ylim(0, 300000) +
  labs(title='Maternal Deaths vs Skilled Attendants at Birth (SAB)') +
  xlab('Skilled Attendant at Birth (%)') +
  ylab('Maternal Deaths')

Summary

I started this project with a single core dataset with MMR by country and year. As I tried to answer questions, I found I needed additional data which lead to chasing down populations, birthrates and country codes. Only then did I seek out the Skilled Attendants and Health Spending data to look for patterns. Worldwide, the reported maternal mortality rate has dropped by ~40% which broadly shows up in my histograms. While the overall mortality rate is decreasing, some countries are greatly improving (e.g. India) while others have more deaths (e.g. Nigeria). Looking at health spend, GDP and presence of Skilled Attendants during birth, the later had the most notable correlation, but data was scarce. The NGO I worked with was targeting training of health workers, and from this quick survey of data, that is likely a fruitful approach to the problem.