DATA606 Final Project

Source files: https://github.com/djlofland/DATA606_F2019/tree/master/FinalProject

knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, warning = FALSE)

library(tidyverse)
library(formattable) # Pretty print tables
library(scales)
library(caret)
library(corrr)
library(Hmisc)
library(corrplot)
library(glmnet)
library(choroplethr)
library(maps)
library(mgcv)


#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
  par(mfrow = c(1, 2))
  hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
       xlab = "Residuals",col="lightgreen")
  lines(density(model[["residuals"]], kernel = "ep"), col="blue", lwd=3)
  curve(dnorm(x, 
              mean=mean(model[["residuals"]]), 
              sd=sd(model[["residuals"]])), 
        col="red", lwd=3, lty="dotted", add=T)
  qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
  qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
  par(mfrow = c(1, 1))
}


#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
logNormalPlot <- function(data) {
  par(mfrow = c(1, 2))
  hist(data, freq = FALSE, breaks = "fd", main = "Log Transformed",
       xlab = "Log Values", col="lightgreen")
  lines(density(data, kernel = "ep"), col="blue", lwd=3)
  curve(dnorm(x, mean=mean(data), sd=sd(data)), 
        col="red", lwd=3, lty="dotted", add=T)
  qqnorm(data, main = "Log Transformed Q-Q plot")
  qqline(data, col="red", lwd=3, lty="dotted")
  par(mfrow = c(1, 1))
}

Part 1 - Introduction

In the late 1990’s I worked for an NGO that focused on training doctors, nurses and midwives on women reproductive health issues and health during pregnancy. The NGO developed and tested training programs in very rural areas around the world to teach health care providers how to provide care for women with no easy access to primary health care facilities or doctors.

UNICEF provides data sets with data and trends in both maternal health and outcomes along with indicators by country.

For an understanding of the problem space, see Maternal mortality.

Some key facts they provide:

Every day in 2017, approximately 810 women died from preventable causes related to pregnancy and childbirth.
Between 2000 and 2017, the maternal mortality ratio (MMR, number of maternal deaths per 100,000 live births) dropped by about 38% worldwide.
94% of all maternal deaths occur in low and lower middle-income countries.
Young adolescents (ages 10-14) face a higher risk of complications and death as a result of pregnancy than other women.

I will consider Maternal Mortality Rate by country through the features: GDP, Population, Births, Health Spending, and Skilled Attendants during Birth. While factors like GDP and Population are largely out of the control of individual countries, Birthrate, Health Spending, and Skilled Staff at Birth are levers a country could influence through policies and education if these were found to correlate with Maternal Mortality Rates. Note this is an observational study looking for correlations, not causation. Features found to have significant correlation would be candidates for follow-up experimental studies.

Part 2 - Data

The primary data sets I’ll explore are:

Each of these data sets required Tidy processing to clean up. I also join the data where possible by country and year to tease out additional patterns not available within any one data set.

To facilitate joining tables using different Country label conventions, I scraped a look-up table with country names and ISO3 codes which I join to my various data sets.

World ISO3 Country Codes

These data sets explore different years, but have broad overlap between 2000-2015 (every 5 years). My analysis will be somewhat limited by the available data. If I were doing a deeper dive, I would look for more complete data and/or limit my questions to those countries that have a richer set of data. My suspicion is that better country data and reporting may indicate more attention to tackling problems with maternal mortality and consequently, better data might come from countries with better reporting and more on-the-ground efforts. In cases where we have null values (typically introduced when joining different data sets), I apply a drop_na() and limit analysis to available data.

All data sets are downloaded from primary internet sources and cached locally to ease repeat analysis.

cache_folder <- './data/cache/'   # path where we cache csv data files
raw_folder <- './data/raw/'       # path wher we cache raw data files we download

Load Country Codes

During initial research to locate data sets, I found that different data sets included ISO2 (Alpha 2), ISO3 (Alpha 3) and/or descriptive country names. Just in case it’s needed, we need a look-up table to join data sets using different country name conventions.

url             <- "https://www.iban.com/country-codes"
cache_fn        <- 'country_codes.csv'

isCacheFound    <- FALSE
local_cache_fn  <- paste(cache_folder, cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
}

# Cached ata file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(local_cache_fn, col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else {
  # Load Raw Data
  data_df <- url %>%
    xml2::read_html() %>%
    html_nodes(xpath='//*[@id="myTable"]') %>%
    html_table()
  
  data_df <- data_df[[1]]
  
  # Cache the processed as CSV for future
  write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}

iso3_codes <- data_df

# Convert year columns into a variable
iso3_codes <- iso3_codes %>% 
  mutate(ISO3 = `Alpha-3 code`, ISO2 = `Alpha-2 code`) %>%
  select(Country, ISO3, ISO2)

print(msg)
## [1] "Cached CSV data loaded."

knitr::kable(iso3_codes[1:10, ], booktabs = T, caption = "ISO2 and ISO3 Country Codes.  Tidy data, first 10 rows shown.  (Source: http://iban.com/country-codes)")

ISO2 and ISO3 Country Codes. Tidy data, first 10 rows shown. (Source: http://iban.com/country-codes)
Country	ISO3	ISO2
Afghanistan	AFG	AF
Åland Islands	ALA	AX
Albania	ALB	AL
Algeria	DZA	DZ
American Samoa	ASM	AS
Andorra	AND	AD
Angola	AGO	AO
Anguilla	AIA	AI
Antarctica	ATA	AQ
Antigua and Barbuda	ATG	AG

Load Population Data

Changes in country population might change how we interpret maternal mortality rate. While the rate should theoretically be independent from population (by definition MMR is maternal deaths per 100,000 live births), if there is a large increase or decrease in population, this might impact conditions around pregnancy (nutrition, access to health care, etc) and/or conditions at time of birth (e.g. access to health care). Since MMR is “deaths per live births”, I’ll end up needing population data to help calculate the number of live births.

url        <- "https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population"
cache_fn   <- 'population.csv'

isCacheFound <- FALSE
local_cache_fn    <- paste(cache_folder, cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
}

# Cached ata file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(local_cache_fn, col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else {
  # Load Raw Data
  data_df <- url %>%
    xml2::read_html() %>%
    html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
    html_table()
  
  data_df <- data_df[[1]]
  
  # Cache the processed as CSV for future
  write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}

population_df <- data_df

# Convert year columns into a variable
population_df <- population_df %>% 
  mutate(Country = `Country (or dependent territory)`) %>%
  select(Country, `2000`, `2005`, `2010`, `2015`) %>%
  gather(Year, Population, -Country)

# Fix datatypes
population_df$Year <- as.numeric(population_df$Year)

# Filter years  
population_df <- population_df %>% filter(Year >= 2000 & Year <= 2015) %>% arrange(Country, Year)
population_df$Population <- population_df$Population* 1000
population_df <- population_df %>% left_join(iso3_codes)

print(msg)
## [1] "Cached CSV data loaded."

knitr::kable(population_df[1:10, ], booktabs = T, caption = "Population Data.  Tidy data, first 10 rows shown.  (Source: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)")

Population Data. Tidy data, first 10 rows shown. (Source: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)
Country	Year	Population	ISO3	ISO2
Afghanistan	2000	22462000	AFG	AF
Afghanistan	2005	26335000	AFG	AF
Afghanistan	2010	29121000	AFG	AF
Afghanistan	2015	32565000	AFG	AF
Albania	2000	3159000	ALB	AL
Albania	2005	3025000	ALB	AL
Albania	2010	2987000	ALB	AL
Albania	2015	3030000	ALB	AL
Algeria	2000	30639000	DZA	DZ
Algeria	2005	32918000	DZA	DZ

Load Birthrate Data

While there is variation between countries, the overall birthrate worldwide has declined. To understand changes in maternal mortality, we might need to estimate how many births are occurring during years of interest in each country. It’s possible that changes in birth rates could have an impact on MMR.


# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
source_url        <- 'http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=excel'
raw_fn            <- 'API_SP.DYN.CBRT.IN_DS2_en_excel_v2_248743.xlsx'
cache_fn          <- 'births.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 3  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 0  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 1      # optional for Excel Workbooks
data_column_names <- c('Country', 'ISO3', 'Indicator', 'Indicator_Code', as.character(seq(1960, 2017, 1)))

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn    <- paste(cache_folder, cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Attempt to download the data set
  download.file(source_url, local_raw_fn, method="auto")
  
  if(file.size(local_raw_fn) > 0) {
    msg <- paste('Downloaded file from: ', source_url, sep='')
    isRawFound <- TRUE
  } else {
    msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
  }
}

print(msg)
## [1] "Found cached copy: ./data/cache/births.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

birthrate_df   <- data_df

# Convert year columns into a variable
birthrate_df <- birthrate_df %>% 
  select(Country, `2000`, `2005`, `2010`, `2015`) %>%
  gather(Year, BirthRate, -Country) %>% 
  arrange(Country, Year) %>% 
  inner_join(iso3_codes)

# Fix datatypes
birthrate_df$Year <- as.numeric(birthrate_df$Year)

print(msg)
## [1] "Cached CSV data loaded."

knitr::kable(birthrate_df[1:10, ], booktabs = T, caption = "Birthrate Data.  Tidy data, first 10 rows shown.  Births per 1000 people.  (Source: http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=csv)")

Birthrate Data. Tidy data, first 10 rows shown. Births per 1000 people. (Source: http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=csv)
Country	Year	BirthRate	ISO3	ISO2
Afghanistan	2000	48.021	AFG	AF
Afghanistan	2005	44.723	AFG	AF
Afghanistan	2010	39.829	AFG	AF
Afghanistan	2015	34.809	AFG	AF
Albania	2000	16.436	ALB	AL
Albania	2005	12.821	ALB	AL
Albania	2010	12.001	ALB	AL
Albania	2015	12.197	ALB	AL
Algeria	2000	19.554	DZA	DZ
Algeria	2005	20.774	DZA	DZ

Load Maternal Mortality Data

This is the primary data set to which everything else will be linked. Here we see the maternal mortality rate provided as deaths per 100,000 live births broken out by country and year.


# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
source_url        <- 'https://data.unicef.org/wp-content/uploads/2015/11/MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
raw_fn            <- 'MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
cache_fn          <- 'mmr_data.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 4  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 17  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 1      # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','2000','2005','2010','2015', '2017')

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn    <- paste(cache_folder, cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Attempt to download the data set
  download.file(source_url, local_raw_fn, method="auto")
  
  if(file.size(local_raw_fn) > 0) {
    msg <- paste('Downloaded file from: ', source_url, sep='')
    isRawFound <- TRUE
  } else {
    msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
  }
}

print(msg)
## [1] "Found cached copy: ./data/cache/mmr_data.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

mmr_df <- data_df

# Convert year columns into a variable
mmr_df <- mmr_df %>% 
  gather(Year, MMR, -ISO3, -Country)

# Fix datatypes
mmr_df$Year <- as.numeric(mmr_df$Year)

# Filter years  
mmr_df <- mmr_df %>% filter(Year <= 2015) %>% arrange(Country, Year)

print(msg)
## [1] "Cached CSV data loaded."

knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality Data.  Count of maternal deaths per 100,000 live births.  Tidy data, first 10 rows shown. Years included: 2000, 2005, 2010, 2015.  (Source: https://data.unicef.org/resources/data set/maternal-health-data/)")

Maternal Mortality Data. Count of maternal deaths per 100,000 live births. Tidy data, first 10 rows shown. Years included: 2000, 2005, 2010, 2015. (Source: https://data.unicef.org/resources/data set/maternal-health-data/)
ISO3	Country	Year	MMR
AFG	Afghanistan	2000	1450
AFG	Afghanistan	2005	1140
AFG	Afghanistan	2010	954
AFG	Afghanistan	2015	701
ALB	Albania	2000	23
ALB	Albania	2005	22
ALB	Albania	2010	21
ALB	Albania	2015	15
DZA	Algeria	2000	161
DZA	Algeria	2005	127

Load Skilled Attendant Data

UNICEF provides country level data describing a number of different indicators related to pregnancy and births. The downloaded raw data set (Excel file) includes a number of different indicator tabs, “Skilled Attendant at Birth” (SAB) being only one of many. We may be interested in how the presence of a trained attendant might correlate with the more central question of maternal mortality during birth. UNICEF don’t report for all countries, only a subset, so that may limit analysis to specific countries.


# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
source_url        <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn            <- 'maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
cache_fn          <- 'sab_indicators.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 8  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 9  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 5      # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','Year','Total','Age_1517','Age_1819', 'Age_lt20', 'Age_gt20', 'Age_2034', 'Age_3549', 'Source', 'SourceYear')

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn    <- paste(cache_folder, cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Attempt to download the data set
  download.file(source_url, local_raw_fn, method="auto")
  
  if(file.size(local_raw_fn) > 0) {
    msg <- paste('Downloaded file from: ', source_url, sep='')
    isRawFound <- TRUE
  } else {
    msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
  }
}

print(msg)
## [1] "Found cached copy: ./data/cache/sab_indicators.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

sab_df   <- data_df

# While this is a wide data set and it's trivial to gather() the age data, for my questions, I really don't need age.  Since my MMR, Population and NHA data do not provide Age breakouts, I will only focus on the Total percentage.
sab_df <- sab_df %>% select(ISO3, Country, Year, Total, Source)

# Fix datatypes
sab_df$Year <- as.numeric(sab_df$Year)
sab_df$Total <- as.numeric(sab_df$Total)

# Since there are two different sources of data, we might have a case where they report for the same year.  I'm going to group_by() and average the totals in the case where we have multiple data points.
sab_df <- sab_df %>% 
  group_by(ISO3, Year) %>% 
  dplyr::summarize(SAB_Pct = mean(Total, na.rm = TRUE))
sab_df <- sab_df %>% 
  filter(Year == 2000 | Year == 2005 | Year == 2010 | Year == 2015)

print(msg)
## [1] "Cached CSV data loaded."

knitr::kable(sab_df[1:10, ], booktabs = T, caption = "Skilled Attendant at Birth.  Tidy data, first 10 rows shown.  (Source: https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx)")

Skilled Attendant at Birth. Tidy data, first 10 rows shown. (Source: https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx)
ISO3	Year	SAB_Pct
AFG	2010	39.00000
ALB	2000	99.12912
ALB	2005	99.75600
ARM	2000	97.17686
ARM	2005	98.58195
ARM	2010	100.00000
AZE	2000	87.46867
BDI	2000	25.17007
BDI	2005	33.62995
BDI	2010	67.18792

Load Health Expenditures Data

WHO provides on online tool to export data from WHO member countries on a variety of measures. After exploring options, I honed in on Health Expenditures. We may be curious whether spending on Health has any noticeable correlation with maternal mortality outcomes. I manually downloaded the NHA_indicators.xlsx from their online tool.

reference: http://apps.who.int/nha/database/Home/Index/en source: http://apps.who.int/nha/database/ViewData/Indicators/en

# reference: http://apps.who.int/nha/database/Home/Index/en
# source: http://apps.who.int/nha/database/ViewData/Indicators/en

# We first want to check if the data has already been loaded and muched  as a local cached .csv file.  If the csv file is available, use that.  If 
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.

# Source Data File
#source_url        <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn            <- 'NHA_indicators.xlsx'
cache_fn          <- 'nha_indicators.csv'

# Which WorkSheet to load and there are some extra rows above & below our table of interest. 
data_header_rows  <- 2  # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows    <- 0  # rows at bottom to skip
data_header       <- FALSE
data_sheet        <- 1      # optional for Excel Workbooks
data_column_names <- c('Country','Indicator','Note','Year_2000','Year_2005','Year_2010', 'Year_2015')

# Some variables to help with flow control
isCacheFound      <- FALSE
isRawFound        <- FALSE
isDataLoaded      <- FALSE

local_raw_fn      <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn    <- paste(cache_folder, cache_fn, sep = '')

# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
  msg <- paste('Found cached copy: ', local_cache_fn, sep='')
  isCacheFound <- TRUE
  
} else if(file.exists(local_raw_fn)) {
  msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
  isRawFound <- TRUE
  
} else {
  # Since data came from an online tool, we don't have a specific file we can download
  msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}

print(msg)
## [1] "Found cached copy: ./data/cache/nha_indicators.csv"

# Cached data file found
if (isCacheFound) {
  # Load CSV Files
  data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  isDataLoaded <- TRUE
  msg <- 'Cached CSV data loaded.'

} else if (isRawFound) {
  # Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
  if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
    data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
    isDataLoaded <- TRUE
    msg <- 'Raw data loaded.'
    
    ##### DATA PROCESSING start #####
    
    # Are there any extraneous row at to top to remove?
    if (data_header_rows > 0) {
      data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
      data_header_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Did the raw file have any extraneous rows at the bottom to remove?
    if (data_tail_rows > 0) {
      last_row <- nrow(data_df)-data_tail_rows
      data_df <- data_df[0:last_row,]
      data_tail_rows <- 0
      rownames(data_df) <- NULL
    }
    
    # Remove any empty columns in the raw data
    data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
    
    # Set the Column names
    if (length(data_column_names) > 0) {
      names(data_df) <- data_column_names
      data_column_names <- list()
    }
    
    # Convert to a tibble
    data_df <- as_tibble(data_df)
    
    ##### DATA PROCESSING done #####
    
    # Cache the processed as CSV for future
    write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
    msg <- 'Raw data loaded, processed and saved to cache.'
    
    data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
  }
} else {
  # Bummer - no load data file was found to load
  msg <- 'Data NOT loaded.'
}

nha_df <- data_df

print(msg)
## [1] "Cached CSV data loaded."

# Remove to extra Note column we won't need
nha_df <- nha_df %>% select(-Note)

# Convert year columns into a variable
nha_df <- nha_df %>%
  gather(Year, Amount, -Country, -Indicator) %>% 
  mutate(Year = stringr::str_replace(Year, "Year_", "")) %>% 
  drop_na()

# Fix datatypes
nha_df$Year <- as.numeric(nha_df$Year)
nha_df$Amount <- as.numeric(nha_df$Amount)

# Break out CHE into a separate column
exp_data <- nha_df %>% filter(Indicator == 'Current Health Expenditure (CHE) per Capita in PPP') %>% 
  mutate(CHE = Amount) %>%
  select(-Indicator, -Amount) %>%
  arrange(Country, Year) %>% 
  drop_na()

# Break out GDP into a separate column
gdp_data <- nha_df %>% filter(Indicator == 'Gross Domestic Product') %>% 
  mutate(GDP = Amount) %>%
  select(-Indicator, -Amount) %>%    
  arrange(Country, Year) %>% 
  drop_na()

# Join the columns back into a table based on Country & Year
nha_df <- gdp_data %>% 
  left_join(exp_data) %>%
  left_join(iso3_codes)

knitr::kable(nha_df[1:10, ], booktabs = T, caption = "WHO Reported Health Expenditures & GDP by Country & Year.  Tidy data, first 10 rows shown. Note CHE are adjusted with Purchase Price Parity (PPP) so dollars have equivalent spending power per country.  This is also sometimes referred to as the 'Big Mac Index'  (Source: http://apps.who.int/nha/database/ViewData/Indicators/en)")

WHO Reported Health Expenditures & GDP by Country & Year. Tidy data, first 10 rows shown. Note CHE are adjusted with Purchase Price Parity (PPP) so dollars have equivalent spending power per country. This is also sometimes referred to as the ‘Big Mac Index’ (Source: http://apps.who.int/nha/database/ViewData/Indicators/en)
Country	Year	GDP	CHE	ISO3	ISO2
Afghanistan	2005	24851.345	98.61208	AFG	AF
Afghanistan	2010	44457.152	132.27128	AFG	AF
Albania	2000	11925.958	258.28327	ALB	AL
Albania	2005	17663.350	363.84904	ALB	AL
Albania	2010	28073.799	478.31474	ALB	AL
Algeria	2000	252385.066	282.38500	DZA	DZ
Algeria	2005	365236.389	354.95766	DZA	DZ
Algeria	2010	455452.366	645.28783	DZA	DZ
Andorra	2000	2132.781	3049.00618	AND	AD
Andorra	2005	3444.711	3741.40393	AND	AD

Part 3 - Exploratory data analysis

Note: PPP refers to “Price Point Parity”. This country-by-country adjustment reflects different purchasing power in different geo’s and attempts to normalize spending for better comparison across geo’s. We know intuitively that cost of living and items vary greatly around the world and $1 in the US doesn’t NOT have the same purchasing power as $1 in India. PPP normalizes so that spending can be compared across countries.

Join Tables

Note that we are combining data from a number of sources and years. As such there will be a number of NA’s where data is missing in various columns. Here I also calculate the actual number of maternal deaths based on populations, birth rates and MMR rates.


# Use the current population by year * birthrate by year to get actual births
births_df <- birthrate_df %>%
  inner_join(population_df) %>%
  mutate(Births = round(Population / 1000 * BirthRate)) %>% 
  select(ISO3, Country, Year, Population, BirthRate, Births)
births_df <- births_df %>% left_join(iso3_codes)

# now that we have births, we can multiply by MMR (deaths per 100,000 births)
mmr_df <- mmr_df %>% 
  left_join(births_df) %>%
  mutate(Deaths = round(Births / 10000 * MMR))
  
# join in Skilled Attendant at Birth (SAB)
mmr_df <- mmr_df %>%
  left_join(sab_df) %>%
  left_join(nha_df)

knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality table with supporting columns.")

Maternal Mortality table with supporting columns.
ISO3	Country	Year	MMR	Population	BirthRate	Births	ISO2	Deaths	SAB_Pct	GDP	CHE
AFG	Afghanistan	2000	1450	22462000	48.021	1078648	AF	156404	NA	NA	NA
AFG	Afghanistan	2005	1140	26335000	44.723	1177780	AF	134267	NA	24851.34	98.61208
AFG	Afghanistan	2010	954	29121000	39.829	1159860	AF	110651	39.00000	44457.15	132.27128
AFG	Afghanistan	2015	701	32565000	34.809	1133555	AF	79462	NA	NA	NA
ALB	Albania	2000	23	3159000	16.436	51921	AL	119	99.12912	11925.96	258.28327
ALB	Albania	2005	22	3025000	12.821	38784	AL	85	99.75600	17663.35	363.84904
ALB	Albania	2010	21	2987000	12.001	35847	AL	75	NA	28073.80	478.31474
ALB	Albania	2015	15	3030000	12.197	36957	AL	55	NA	NA	NA
DZA	Algeria	2000	161	30639000	19.554	599115	DZ	9646	NA	252385.07	282.38500
DZA	Algeria	2005	127	32918000	20.774	683839	DZ	8685	NA	365236.39	354.95766

MMR from 2000 to 2015

We generally see a decline in mortality rates from 2000 to 2015 data. We observe in both the histogram and density plots that the weight is shifted towards lower MMR in 2015 vs 2000.

data <- mmr_df %>% filter(Year==2000 | Year==2015) %>% select(MMR, Year)
data$Year <- as.factor(data$Year)

ggplot(data=data, aes(x=MMR, color=Year)) +
  geom_histogram(fill='white', alpha=0.2, position="identity") +
  scale_color_discrete(name="Year", labels=c(2000, 2015)) + labs(title = "2000 and 2015 Maternal Mortality Rates (MMR)") +
  theme(plot.title = element_text(hjust = 0.5))


ggplot(data=data, aes(x=MMR, color=Year)) +
  geom_density(fill='white', alpha=0.2, position="identity") +
  scale_color_discrete(name="Year", labels=c(2000, 2015)) + labs(title = "2000 and 2015 Maternal Mortality Rates (MMR)") +
  theme(plot.title = element_text(hjust = 0.5))

We can visualize MMR’s on a world map to get a sense for geographical distribution and where higher or lower MMR’s are more prevalent.


data <- mmr_df %>% 
  select(Country, MMR, Year) %>%
  rename(region = Country,
         value = MMR) %>%
  mutate(region = tolower(region)) %>%
  mutate(region = recode(region,
                          "united states"    = "united states of america",
                          "congo, dem. rep." = "democratic republic of the congo",
                          "congo, rep."      = "republic of congo",
                          "korea, dem. rep." = "south korea",
                          "korea. rep."      = "north korea",
                          "tanzania"         = "united republic of tanzania",
                          "serbia"           = "republic of serbia",
                          "slovak republic"  = "slovakia",
                          "yemen, rep."      = "yemen"))

country_choropleth(data %>% filter(Year == 2000), num_colors = 9) +
  scale_fill_brewer(palette="YlOrRd") +
  labs(title = paste("MMR by country (", 2000, ")"), fill = "value") + 
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))


country_choropleth(data %>% filter(Year == 2005), num_colors = 9) +
  scale_fill_brewer(palette="YlOrRd") +
  labs(title = paste("MMR by country (", 2005, ")"), fill = "value") + 
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))


country_choropleth(data %>% filter(Year == 2010), num_colors = 9) +
  scale_fill_brewer(palette="YlOrRd") +
  labs(title = paste("MMR by country (", 2010, ")"), fill = "value") + 
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))


country_choropleth(data %>% filter(Year == 2015), num_colors = 9) +
  scale_fill_brewer(palette="YlOrRd") +
  labs(title = paste("MMR by country (", 2015, ")"), fill = "value") + 
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))


#data$Year <- as.factor(data$Year)

Changes in Mortality by Country

Let’s see which countries have the most reduction in maternal deaths by taking 2015 data and subtracting 2000 data to find the delta for each country over the 15 years. This is a rough measure of overall trends over the 15 years.

# Group data by country and calculate the delta in deaths over the years
tbl <- mmr_df %>% 
  group_by(Country) %>% 
  dplyr::summarize(delta = round(first(Deaths) - last(Deaths))) %>%
  arrange(desc(delta)) %>%
  drop_na()

# barplot
ggplot(tbl %>% top_n(10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
  ggtitle("Top 10 Countries with Reductions in Deaths") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_bar(position='dodge', stat="identity") + 
  geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.1, vjust = 0.5 ) +
  #scale_y_discrete(labels = comma, breaks=c(1000, 100000, 200000, 600000)) +
  ylim(c(0,700000)) +
  xlab('Country') + 
  ylab('Deaths') +
  theme(legend.position = "none") + 
  coord_flip() + 
  scale_y_continuous(labels = comma)

… and those with increases in maternal deaths. Something profound has happened in India over these years and Idia tops the list with the largest decrease in deaths. At this point, we don’t know why, but that is a very positive trend. Nigeria on the other hand is clearly experiencing a crisis.

# barplot
ggplot(tbl %>% top_n(-10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
  ggtitle("Bottom 10 Countries with Increases in Deaths") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_bar(position='dodge', stat="identity") + 
  geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.5, vjust = 0.5 ) +
  scale_y_discrete(labels = comma, breaks=c(-30000, -20000, -1000)) +
  ylim(c(-25000,0)) +
  xlab('Country') + 
  ylab('Deaths') +
  theme(legend.position = "none") +
  coord_flip()

data <- tbl %>% 
  filter(delta > 0) %>%
  rename(region = Country,
         value = delta) %>%
  mutate(region = tolower(region), 
         value = value) %>%
  mutate(region = recode(region,
                          "united states"    = "united states of america",
                          "congo, dem. rep." = "democratic republic of the congo",
                          "congo, rep."      = "republic of congo",
                          "korea, dem. rep." = "south korea",
                          "korea. rep."      = "north korea",
                          "tanzania"         = "united republic of tanzania",
                          "serbia"           = "republic of serbia",
                          "slovak republic"  = "slovakia",
                          "yemen, rep."      = "yemen"))

country_choropleth(data, num_colors = 9) +
  scale_fill_brewer(palette="Greens") +
  labs(title = paste("MMR Improvements (2000-2015)"), fill = "value") + 
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))

data <- tbl %>% 
  filter(delta < 0) %>%
  rename(region = Country,
         value = delta) %>%
  mutate(region = tolower(region), 
         value = -value) %>%
  mutate(region = recode(region,
                          "united states"    = "united states of america",
                          "congo, dem. rep." = "democratic republic of the congo",
                          "congo, rep."      = "republic of congo",
                          "korea, dem. rep." = "south korea",
                          "korea. rep."      = "north korea",
                          "tanzania"         = "united republic of tanzania",
                          "serbia"           = "republic of serbia",
                          "slovak republic"  = "slovakia",
                          "yemen, rep."      = "yemen"))

country_choropleth(data, num_colors = 9) +
  scale_fill_brewer(palette="Reds") +
  labs(title = paste("MMR Losses (2000-2015)"), fill = "value") + 
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))

# Group data by country and calculate the delta in deaths over the years
#mmr_df <- mmr_df %>% 
#  group_by(Country) %>% 
#  mutate(delta = Deaths - lag(Deaths, order_by=Year, default=first(Deaths)))

Correlation Analysis

Let’s start with a simple correlation matrix to understand which features correlate positively with MMR (i.e. increase dates of death) or negatively (decreased rates of death). Note that we are particularly interested in features in RED that negatively correlate with MMR. We see that Country Health Expenditures (CHE), Skilled Attendant at Birth (SAB_Pct) and GDP appear as promising features to explore. Births have a positive correlation – more births correlating with a higher rate of moms dying. Since MMR is a rate (deaths per 100k births), we might have assumed that the number of births would have no direct affect on the rate of women dying. Since we cannot attribute causation, we can merely say that in countries with more births, there is an increase in the rate of deaths. Maybe more births tax limited resources? This might be a different research question to explore in a later study. Interestingly, Population doesn’t seem to correlate with MMR. So higher population doesn’t seem to be a factor in explaining MMR.

mmr_data <- mmr_df %>%
  select(Population, Births, GDP, CHE, MMR, SAB_Pct) %>%
  drop_na()
mmr_cor <- cor(mmr_data, method = c("spearman"))
corrplot(mmr_cor)


palette = colorRampPalette(c("red", "white", "green")) (20)
heatmap(x = mmr_cor, col = palette, symm = TRUE)

Health Expenditures vs GDP

We see that generally, countries spend less than $4000 (PPP) adjusted dollars per person on health expenditures and as GDP increases, there is an increase in CHE. We see some clear outliers with very high GDP countries spending very little and vice versa, countries with relatively low GDP spending reporting quite high CHE. While this is an interesting pattern, I suspect that CHE is not entirely reported nor being spent on the poorest population that might be more impacted by MMR. Unfortunately, I don’t have the data set to fully explore those questions.

ggplot(mmr_df, aes(GDP, CHE)) +
  geom_point() +
  labs(title='Country Health Expenditure (PPP) vs Country GDP (PPP)') +
  xlab('GDP (Millions, PPP)') +
  ylab('CHE (Dollars, PPP)') + 
  scale_x_continuous(labels = comma) +
  theme(plot.title = element_text(hjust = 0.5))

MMR vs CHE

We do see a pattern where higher reported health expenditures correlate with lower maternal mortality and inversely, those countries (and years) with lower CHE see higher MMR. The inflection point visually starts at ~$1000, but rises most steeply at ~$500, under which the MMR spikes. Countries spending less than $1000 appear to have higher rates of MMR.

ggplot(mmr_df, aes(CHE, MMR)) +
  geom_point() +
  labs(title='Maternal Mortality Rate (MMR) vs Country Health Expenditure (PPP)') +
  xlab('CHE (Dollars, PPP)') +
  ylab('MMR (per 100k live births)') +
  theme(plot.title = element_text(hjust = 0.5)) + 
  stat_smooth()

Maternal Deaths vs CHE

Since MMR is a rate, let’s look at the count of deaths compared with health expenditures. The death count increases most dramatically with lower health expenditures and population impacts. There are a handful of countries with a high number of deaths per year and those correlate with the lowest outlay of health expenditures.

ggplot(mmr_df, aes(CHE, Deaths)) +
  geom_point() +
  labs(title='Maternal Deaths vs Country Health Expenditures (PPP)') +
  xlab('CHE (Dollars, PPP)') +
  ylab('Maternal Deaths') +
  theme(plot.title = element_text(hjust = 0.5))

If we filter out the outliers with huge numbers of deaths, we still visually see that the inflection point is ~$1000.

mmr_df %>% 
  filter(Deaths < 100000) %>%
  ggplot(aes(CHE, Deaths)) +
    geom_point() +
    labs(title='Maternal Deaths vs Country Health Expenditures (PPP)', subtitle='< 100,000 deaths') +
    xlab('CHE (Dollars, PPP)') +
    ylab('Maternal Deaths') +
    theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

MMR vs GDP

Generally the Maternal Mortality rate is higher with counties having lower GDP. However, there are plenty of countries with low GDP and low MMR, so clearly GDP doesn’t tell the whole story.

ggplot(mmr_df, aes(GDP, MMR)) +
  geom_point() +
  labs(title='Maternal Mortality Rate (MMR) vs Country GDP (Millions, PPP)') +
  xlab('GDP (Millions, PPP)') +
  ylab('MMR (per 100k live births)') +
  scale_x_continuous(labels = comma) +
  theme(plot.title = element_text(hjust = 0.5))

Skilled Attendants and MMR

While we only have a limited data set with Skilled Attendant at Birth percentages, the data we do have shows a linear trend. As the percentage of births managed by a health care provide increases, the MMR rate and number of deaths decrease. with lowered maternal mortality when there is a higher percentage of births managed by a health professional. That said, a Skilled Attendant at birth may only be able to do so much … there may be affects earlier in pregnancy leading to mortality. Notice that the linear trendline approximates the non-linear over most of the domain of SAB_Pct.

ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, MMR)) +
  geom_point() +
  labs(title = 'Maternal Mortality Rate vs Skilled Attendants at Birth (SAB)', 
       caption = 'Note: we only have 98 data points with both SAB and MMR',
       fill = 'Model Type') +
  xlab('Skilled Attendant at Birth (%)') +
  ylab('MMR (per 100k live births)') +
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0.0)) +
  stat_smooth(aes(colour = 'non-linear')) +
  stat_smooth(method=lm, 
              aes(colour = 'linear'))

Skilled Attendants and Deaths

As with mortality rate, we generally see fewer reported deaths as the percent of births are handled by a trained health professional. If we consider all data points, then we clealy have a non-linear relationship and a linear model would not be appropriate. However, the data approximates a linear model once the Skilled Attendants is over ~20%.

ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, Deaths)) +
  geom_point() + 
  scale_y_continuous(labels = comma) +
  ylim(0, 300000) +
  labs(fill = "Model Type", title="Maternal Deaths vs Skilled Attendants at Birth (SAB)") +
  xlab('Skilled Attendant at Birth (%)') +
  ylab('Maternal Deaths') +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  stat_smooth(method = gam, 
              formula = y ~ log(x), 
              aes(colour = 'non-linear')) +
  stat_smooth(method=lm, 
              aes(colour = 'linear'))


ggplot(mmr_df %>% drop_na() %>% filter(SAB_Pct > 20), 
       aes(SAB_Pct, Deaths)) +
  geom_point() + 
  scale_y_continuous(labels = comma) +
  ylim(0, 300000) +
  labs(title='Maternal Deaths vs Skilled Attendants at Birth (SAB)',
       subtitle = "> 20% Skilled Attendants",
       fill = 'Model Type') +
  xlab('Skilled Attendant at Birth (%)') +
  ylab('Maternal Deaths') +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  stat_smooth(method = gam, 
              formula = y ~ log(x), 
              aes(colour = 'non-linear')) +
  stat_smooth(method=lm, 
              aes(colour = 'linear'))

Part 4 - Inference

Let’s build simple models with the known features that helps to explain MMR and explore which features might have more influence on outcomes. This type of analysis would be useful for policy makers to design followup experiments that attempt to reduce MMR.I’ll start by exploring each feature separately, then combine and do multiple regression.

Population Model

Population is highly skewed which isn’t ideal for regression analysis. For linear regressions, we ideally want a normally distributed data set - when we log transform Population, the distribution is approximately normal with some bumps on the far right. We see this in the Q-Q plot as well. We will go ahead with a linear model and explore how much Population influences MMR over most of the data points. Interestingly, Population is a significant factor and as Population increases, MMR increases. However, with an adjusted R$^2$ of 0.008236, Population explains less than 1% of the variability in the model and we see that the residuals are not normally distributed suggesting that a linear model with Population probably isn’t appropriate. Given this, we won’t consider Population in our multiple regression model.

pop <- mmr_df$Population[!is.na(mmr_df$Population)]

hist(pop, breaks = 25)

logNormalPlot(log(pop))


# Run Linear Regression using Population to predict DeathRate
pop_lm <- lm(MMR ~ log(Population), data = mmr_df, rm.na = TRUE)

(pop_lm <- summary(pop_lm))
## 
## Call:
## lm(formula = MMR ~ log(Population), data = mmr_df, rm.na = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -281.23 -199.69 -141.84   76.82 2273.08 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -61.757    116.009  -0.532    0.595  
## log(Population)   17.731      7.336   2.417    0.016 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 323.4 on 582 degrees of freedom
##   (156 observations deleted due to missingness)
## Multiple R-squared:  0.009937,   Adjusted R-squared:  0.008236 
## F-statistic: 5.841 on 1 and 582 DF,  p-value: 0.01596

residPlot(pop_lm)

Births

As with Population, Births are also highly skewed. We will again use a log transformation as we explore the relationship of Births and MMR. log(Births) is highly significant and as Births increase, MMR increases and in this simple model we see an adjusted R$^2$ of 0.08013. On it’s own, Births explains < 10% of variability we see in MMR. Of concern is that our residuals are not normally distributed as seen in the Q-Q plot. If we want to try using Births we would need to introduce a non-linear regression. We will not include Births in any combined models.

bir <- mmr_df$Births[!is.na(mmr_df$Births)]
hist(bir)


logNormalPlot(log(bir))

# Run Linear Regression using Population to predict DeathRate
birth_lm <- lm(MMR ~ log(Births), data = mmr_df, rm.na = TRUE)

(birth_lm <- summary(birth_lm))
## 
## Call:
## lm(formula = MMR ~ log(Births), data = mmr_df, rm.na = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -423.00 -186.05 -109.10   73.43 2252.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -362.742     81.551  -4.448 1.04e-05 ***
## log(Births)   49.084      6.821   7.196 1.91e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 311.4 on 582 degrees of freedom
##   (156 observations deleted due to missingness)
## Multiple R-squared:  0.08171,    Adjusted R-squared:  0.08013 
## F-statistic: 51.79 on 1 and 582 DF,  p-value: 1.915e-12

residPlot(birth_lm)

Skilled Attendants

In our exploratory analysis, SAB_Pct seemed to be highly negatively correlated with MMR. We find that SAB_Pct is highly significant and explains ~53% of variability in MMR. We should note that we had limited data for SAB_Pct, but having skilled attendants at birth appears to be an important feature related to lowered MMR. In reality, a linear model is probably not the best choice since the data is not normally distributed and we see a spike ~100%. That said, the data is close to a uniform distribution. I’ll use a linear model with the caveat that its an approximation. We note that the residuals have a few bumps in the extremes, but are closer to normal. The Q-Q plot is somewhat linear through the central region. If we were looking to make a predictive model, we might explore removing outliers that are affecting the extremes or using a non-linear model (e.g. GAM). For a simple analysis to answer whether SAB_Pct is an important feature, this is probably sufficient.

hist(mmr_df$SAB_Pct, breaks = 50)


# Run Linear Regression using Population to predict DeathRate
sab_lm <- lm(MMR ~ SAB_Pct, data = mmr_df)

(sab_lm <- summary(sab_lm))
## 
## Call:
## lm(formula = MMR ~ SAB_Pct, data = mmr_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -584.43 -126.32  -28.24   91.64 1716.52 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1276.450     85.427   14.94   <2e-16 ***
## SAB_Pct      -12.296      1.179  -10.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 310.4 on 96 degrees of freedom
##   (642 observations deleted due to missingness)
## Multiple R-squared:  0.5312, Adjusted R-squared:  0.5263 
## F-statistic: 108.8 on 1 and 96 DF,  p-value: < 2.2e-16

residPlot(sab_lm)

Health Expenditures

Since CHE is highly skewed, we will use the log transform of CHE in our model. The log transformed values are approximately normally distributed. We find that in a simple model, log(CHE) is highly significant and as health expenditures increase, MMR decreases. The model showed an adjusted R$^2$ of 0.4366 meaning that CHE on it’s own accounts for ~44% of the variability we see in MMR by country. In our residuals, we do see some skew and bumps in the upper regions suggesting some outliers. However, the distribution is mostly normal and the Q-Q plot linear through most of the residual points.

che <- mmr_df$CHE[!is.na(mmr_df$CHE)]
hist(che)

logNormalPlot(log(che))


# Run Linear Regression using Population to predict DeathRate
che_lm <- lm(MMR ~ log(CHE), data = mmr_df)

(che_lm <- summary(che_lm))
## 
## Call:
## lm(formula = MMR ~ log(CHE), data = mmr_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.18 -138.63  -30.99   80.83 2078.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1161.193     53.900   21.54   <2e-16 ***
## log(CHE)    -158.228      8.788  -18.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 240 on 416 degrees of freedom
##   (322 observations deleted due to missingness)
## Multiple R-squared:  0.438,  Adjusted R-squared:  0.4366 
## F-statistic: 324.2 on 1 and 416 DF,  p-value: < 2.2e-16

residPlot(che_lm)

GDP

GDP is highly skewed, so we will use the log transform. The log transformation yielded values that are close to normally distributed. GDP is a significant factor, but only explains about ~9% of the variability we see in MMR. As a country’s GDP increases, we do see a decrease in MMR. However, the residual plots suggest that a linear treatment might not be appropriate for GDP if we were looking for a true predictive model. For a simple yes/no as to whether GDP is a significant factor, we can probably say yes, but that it doesn’t account for much of the variability.

gdp <- mmr_df$GDP[!is.na(mmr_df$GDP)]
hist(gdp)

logNormalPlot(log(gdp))


# Run Linear Regression using Population to predict DeathRate
gdp_lm <- lm(MMR ~ log(GDP), data = mmr_df)

(gdp_lm <- summary(gdp_lm))
## 
## Call:
## lm(formula = MMR ~ log(GDP), data = mmr_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -366.5 -176.5 -103.9   97.9 2159.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  712.000     78.320   9.091  < 2e-16 ***
## log(GDP)     -46.197      7.129  -6.480 2.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 304.8 on 417 degrees of freedom
##   (321 observations deleted due to missingness)
## Multiple R-squared:  0.09149,    Adjusted R-squared:  0.08931 
## F-statistic: 41.99 on 1 and 417 DF,  p-value: 2.59e-10

residPlot(gdp_lm)

Combined Linear Model

Let’s perform a multiple regression combining the features that independently showed significance and seemed appropriate for a linear model: SAB_Pct and CHE. In this model, we explain ~55% of the variability and the model as a whole is very significant with a p-value close to 0. However, we only find SAB_Pct to be significant. We had previously noted that health expenditures (CHE) and Skilled attendants at births (SAB_PCT) were highly auto-correlated. As such, the model has arbitrarily attributed the effects most strongly with SAB_Pct and has treated CHE as not significant. This is a limitation of simple linear regressions.

# Run Linear Regression using Population to predict DeathRate
combined_lm <- lm(MMR ~ SAB_Pct + log(CHE), data = mmr_df)

(combined_lm <- summary(combined_lm))
## 
## Call:
## lm(formula = MMR ~ SAB_Pct + log(CHE), data = mmr_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -575.23 -122.94  -17.92   74.31 1677.46 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1240.050    240.568   5.155 2.43e-06 ***
## SAB_Pct      -13.819      2.375  -5.818 1.83e-07 ***
## log(CHE)      28.955     69.185   0.419    0.677    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 324.5 on 67 degrees of freedom
##   (670 observations deleted due to missingness)
## Multiple R-squared:  0.5584, Adjusted R-squared:  0.5452 
## F-statistic: 42.37 on 2 and 67 DF,  p-value: 1.281e-12

residPlot(combined_lm)

GAM

As an exploratory effort, let’s try fitting a Generalized Additive Model to the features to see whether we can tease out the influence of CHE vs SAB_Pct. Unfortunately, we again are not able to differentiate the effect of SAB_Pct from CHE, there is just too much overlap between their influences.

# Build the model
model <- gam(MMR ~ s(SAB_Pct) + s(log(CHE)), data = mmr_df)

summary(model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## MMR ~ s(SAB_Pct) + s(log(CHE))
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   447.16      38.69   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df      F  p-value    
## s(SAB_Pct)  1.000  1.000 34.151 1.21e-07 ***
## s(log(CHE)) 1.265  1.488  0.116    0.741    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.548   Deviance explained = 56.2%
## GCV = 1.0989e+05  Scale est. = 1.0476e+05  n = 70

ggplot(mmr_df, aes(SAB_Pct, MMR) ) +
  geom_point() +
  stat_smooth(method = gam, formula = y ~ s(x))


ggplot(mmr_df, aes(log(CHE), MMR) ) +
  geom_point() +
  stat_smooth(method = gam, formula = y ~ s(x))


ggplot(mmr_df, aes(log(CHE), SAB_Pct) ) +
  geom_point() +
  stat_smooth(method = gam, formula = y ~ s(x))

Part 5 - Conclusion

Of the features explored (Population, GDP, Births, Health Expenditures and Skilled Attendant at Birth), I found that Health Expenditures and Skilled Attendant at Birth have the highest correlation with reduced Maternal Mortality Rates at child birth. These features were both signification and together explained ~55% of the variability we see in MMR across countries.

Of note, Skilled Attendants explained more of the variability in MMR (higher R$^2$) than CHE; however, we had very few datapoints with Skilled Attendants. On the other hand, CHE explained less (lower R$^2$), but we had many more data points. Given this, it would be worthwhile to either track down additional data for Skilled Attendants, find a proxy measure for Skilled Attendants or design followup studies/surverys to gather this data. Because we have so few data points for Skilled Attendants we cannot be sure it is a random sample – possibly the data set is biased where Skilled Attendants are reported in countries that have better outcomes. The fact that Skilled Attendants had the highest explanatory value, there is something going on that probably justifies further investigation.

While we cannot attribute causality, modeling would suggest these are areas for further experimental studies. To a less extent, we saw the increased Births had a positive correlation with MMR. It would be worthwhile to explore other data sources or do followup analysis to understand why more births might be associated with higher death rates.