Source files: https://github.com/djlofland/DATA606_F2019/tree/master/FinalProject
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, warning = FALSE)
library(tidyverse)
library(formattable) # Pretty print tables
library(scales)
library(caret)
library(corrr)
library(Hmisc)
library(corrplot)
library(glmnet)
library(choroplethr)
library(maps)
library(mgcv)
#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
par(mfrow = c(1, 2))
hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
xlab = "Residuals",col="lightgreen")
lines(density(model[["residuals"]], kernel = "ep"), col="blue", lwd=3)
curve(dnorm(x,
mean=mean(model[["residuals"]]),
sd=sd(model[["residuals"]])),
col="red", lwd=3, lty="dotted", add=T)
qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
par(mfrow = c(1, 1))
}
#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
logNormalPlot <- function(data) {
par(mfrow = c(1, 2))
hist(data, freq = FALSE, breaks = "fd", main = "Log Transformed",
xlab = "Log Values", col="lightgreen")
lines(density(data, kernel = "ep"), col="blue", lwd=3)
curve(dnorm(x, mean=mean(data), sd=sd(data)),
col="red", lwd=3, lty="dotted", add=T)
qqnorm(data, main = "Log Transformed Q-Q plot")
qqline(data, col="red", lwd=3, lty="dotted")
par(mfrow = c(1, 1))
}In the late 1990’s I worked for an NGO that focused on training doctors, nurses and midwives on women reproductive health issues and health during pregnancy. The NGO developed and tested training programs in very rural areas around the world to teach health care providers how to provide care for women with no easy access to primary health care facilities or doctors.
UNICEF provides data sets with data and trends in both maternal health and outcomes along with indicators by country.
For an understanding of the problem space, see Maternal mortality.
Some key facts they provide:
I will consider Maternal Mortality Rate by country through the features: GDP, Population, Births, Health Spending, and Skilled Attendants during Birth. While factors like GDP and Population are largely out of the control of individual countries, Birthrate, Health Spending, and Skilled Staff at Birth are levers a country could influence through policies and education if these were found to correlate with Maternal Mortality Rates. Note this is an observational study looking for correlations, not causation. Features found to have significant correlation would be candidates for follow-up experimental studies.
The primary data sets I’ll explore are:
Each of these data sets required Tidy processing to clean up. I also join the data where possible by country and year to tease out additional patterns not available within any one data set.
To facilitate joining tables using different Country label conventions, I scraped a look-up table with country names and ISO3 codes which I join to my various data sets.
These data sets explore different years, but have broad overlap between 2000-2015 (every 5 years). My analysis will be somewhat limited by the available data. If I were doing a deeper dive, I would look for more complete data and/or limit my questions to those countries that have a richer set of data. My suspicion is that better country data and reporting may indicate more attention to tackling problems with maternal mortality and consequently, better data might come from countries with better reporting and more on-the-ground efforts. In cases where we have null values (typically introduced when joining different data sets), I apply a drop_na() and limit analysis to available data.
All data sets are downloaded from primary internet sources and cached locally to ease repeat analysis.
cache_folder <- './data/cache/' # path where we cache csv data files
raw_folder <- './data/raw/' # path wher we cache raw data files we downloadDuring initial research to locate data sets, I found that different data sets included ISO2 (Alpha 2), ISO3 (Alpha 3) and/or descriptive country names. Just in case it’s needed, we need a look-up table to join data sets using different country name conventions.
url <- "https://www.iban.com/country-codes"
cache_fn <- 'country_codes.csv'
isCacheFound <- FALSE
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
}
# Cached ata file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(local_cache_fn, col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else {
# Load Raw Data
data_df <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[@id="myTable"]') %>%
html_table()
data_df <- data_df[[1]]
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}
iso3_codes <- data_df
# Convert year columns into a variable
iso3_codes <- iso3_codes %>%
mutate(ISO3 = `Alpha-3 code`, ISO2 = `Alpha-2 code`) %>%
select(Country, ISO3, ISO2)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(iso3_codes[1:10, ], booktabs = T, caption = "ISO2 and ISO3 Country Codes. Tidy data, first 10 rows shown. (Source: http://iban.com/country-codes)")| Country | ISO3 | ISO2 |
|---|---|---|
| Afghanistan | AFG | AF |
| Åland Islands | ALA | AX |
| Albania | ALB | AL |
| Algeria | DZA | DZ |
| American Samoa | ASM | AS |
| Andorra | AND | AD |
| Angola | AGO | AO |
| Anguilla | AIA | AI |
| Antarctica | ATA | AQ |
| Antigua and Barbuda | ATG | AG |
Changes in country population might change how we interpret maternal mortality rate. While the rate should theoretically be independent from population (by definition MMR is maternal deaths per 100,000 live births), if there is a large increase or decrease in population, this might impact conditions around pregnancy (nutrition, access to health care, etc) and/or conditions at time of birth (e.g. access to health care). Since MMR is “deaths per live births”, I’ll end up needing population data to help calculate the number of live births.
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population"
cache_fn <- 'population.csv'
isCacheFound <- FALSE
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
}
# Cached ata file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(local_cache_fn, col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else {
# Load Raw Data
data_df <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
html_table()
data_df <- data_df[[1]]
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}
population_df <- data_df
# Convert year columns into a variable
population_df <- population_df %>%
mutate(Country = `Country (or dependent territory)`) %>%
select(Country, `2000`, `2005`, `2010`, `2015`) %>%
gather(Year, Population, -Country)
# Fix datatypes
population_df$Year <- as.numeric(population_df$Year)
# Filter years
population_df <- population_df %>% filter(Year >= 2000 & Year <= 2015) %>% arrange(Country, Year)
population_df$Population <- population_df$Population* 1000
population_df <- population_df %>% left_join(iso3_codes)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(population_df[1:10, ], booktabs = T, caption = "Population Data. Tidy data, first 10 rows shown. (Source: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)")| Country | Year | Population | ISO3 | ISO2 |
|---|---|---|---|---|
| Afghanistan | 2000 | 22462000 | AFG | AF |
| Afghanistan | 2005 | 26335000 | AFG | AF |
| Afghanistan | 2010 | 29121000 | AFG | AF |
| Afghanistan | 2015 | 32565000 | AFG | AF |
| Albania | 2000 | 3159000 | ALB | AL |
| Albania | 2005 | 3025000 | ALB | AL |
| Albania | 2010 | 2987000 | ALB | AL |
| Albania | 2015 | 3030000 | ALB | AL |
| Algeria | 2000 | 30639000 | DZA | DZ |
| Algeria | 2005 | 32918000 | DZA | DZ |
While there is variation between countries, the overall birthrate worldwide has declined. To understand changes in maternal mortality, we might need to estimate how many births are occurring during years of interest in each country. It’s possible that changes in birth rates could have an impact on MMR.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=excel'
raw_fn <- 'API_SP.DYN.CBRT.IN_DS2_en_excel_v2_248743.xlsx'
cache_fn <- 'births.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 3 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 0 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('Country', 'ISO3', 'Indicator', 'Indicator_Code', as.character(seq(1960, 2017, 1)))
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the data set
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)
## [1] "Found cached copy: ./data/cache/births.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
birthrate_df <- data_df
# Convert year columns into a variable
birthrate_df <- birthrate_df %>%
select(Country, `2000`, `2005`, `2010`, `2015`) %>%
gather(Year, BirthRate, -Country) %>%
arrange(Country, Year) %>%
inner_join(iso3_codes)
# Fix datatypes
birthrate_df$Year <- as.numeric(birthrate_df$Year)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(birthrate_df[1:10, ], booktabs = T, caption = "Birthrate Data. Tidy data, first 10 rows shown. Births per 1000 people. (Source: http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=csv)")| Country | Year | BirthRate | ISO3 | ISO2 |
|---|---|---|---|---|
| Afghanistan | 2000 | 48.021 | AFG | AF |
| Afghanistan | 2005 | 44.723 | AFG | AF |
| Afghanistan | 2010 | 39.829 | AFG | AF |
| Afghanistan | 2015 | 34.809 | AFG | AF |
| Albania | 2000 | 16.436 | ALB | AL |
| Albania | 2005 | 12.821 | ALB | AL |
| Albania | 2010 | 12.001 | ALB | AL |
| Albania | 2015 | 12.197 | ALB | AL |
| Algeria | 2000 | 19.554 | DZA | DZ |
| Algeria | 2005 | 20.774 | DZA | DZ |
This is the primary data set to which everything else will be linked. Here we see the maternal mortality rate provided as deaths per 100,000 live births broken out by country and year.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'https://data.unicef.org/wp-content/uploads/2015/11/MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
raw_fn <- 'MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
cache_fn <- 'mmr_data.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 4 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 17 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','2000','2005','2010','2015', '2017')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the data set
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)
## [1] "Found cached copy: ./data/cache/mmr_data.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
mmr_df <- data_df
# Convert year columns into a variable
mmr_df <- mmr_df %>%
gather(Year, MMR, -ISO3, -Country)
# Fix datatypes
mmr_df$Year <- as.numeric(mmr_df$Year)
# Filter years
mmr_df <- mmr_df %>% filter(Year <= 2015) %>% arrange(Country, Year)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality Data. Count of maternal deaths per 100,000 live births. Tidy data, first 10 rows shown. Years included: 2000, 2005, 2010, 2015. (Source: https://data.unicef.org/resources/data set/maternal-health-data/)")| ISO3 | Country | Year | MMR |
|---|---|---|---|
| AFG | Afghanistan | 2000 | 1450 |
| AFG | Afghanistan | 2005 | 1140 |
| AFG | Afghanistan | 2010 | 954 |
| AFG | Afghanistan | 2015 | 701 |
| ALB | Albania | 2000 | 23 |
| ALB | Albania | 2005 | 22 |
| ALB | Albania | 2010 | 21 |
| ALB | Albania | 2015 | 15 |
| DZA | Algeria | 2000 | 161 |
| DZA | Algeria | 2005 | 127 |
UNICEF provides country level data describing a number of different indicators related to pregnancy and births. The downloaded raw data set (Excel file) includes a number of different indicator tabs, “Skilled Attendant at Birth” (SAB) being only one of many. We may be interested in how the presence of a trained attendant might correlate with the more central question of maternal mortality during birth. UNICEF don’t report for all countries, only a subset, so that may limit analysis to specific countries.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn <- 'maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
cache_fn <- 'sab_indicators.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 8 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 9 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 5 # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','Year','Total','Age_1517','Age_1819', 'Age_lt20', 'Age_gt20', 'Age_2034', 'Age_3549', 'Source', 'SourceYear')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the data set
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)
## [1] "Found cached copy: ./data/cache/sab_indicators.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
sab_df <- data_df
# While this is a wide data set and it's trivial to gather() the age data, for my questions, I really don't need age. Since my MMR, Population and NHA data do not provide Age breakouts, I will only focus on the Total percentage.
sab_df <- sab_df %>% select(ISO3, Country, Year, Total, Source)
# Fix datatypes
sab_df$Year <- as.numeric(sab_df$Year)
sab_df$Total <- as.numeric(sab_df$Total)
# Since there are two different sources of data, we might have a case where they report for the same year. I'm going to group_by() and average the totals in the case where we have multiple data points.
sab_df <- sab_df %>%
group_by(ISO3, Year) %>%
dplyr::summarize(SAB_Pct = mean(Total, na.rm = TRUE))
sab_df <- sab_df %>%
filter(Year == 2000 | Year == 2005 | Year == 2010 | Year == 2015)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(sab_df[1:10, ], booktabs = T, caption = "Skilled Attendant at Birth. Tidy data, first 10 rows shown. (Source: https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx)")| ISO3 | Year | SAB_Pct |
|---|---|---|
| AFG | 2010 | 39.00000 |
| ALB | 2000 | 99.12912 |
| ALB | 2005 | 99.75600 |
| ARM | 2000 | 97.17686 |
| ARM | 2005 | 98.58195 |
| ARM | 2010 | 100.00000 |
| AZE | 2000 | 87.46867 |
| BDI | 2000 | 25.17007 |
| BDI | 2005 | 33.62995 |
| BDI | 2010 | 67.18792 |
WHO provides on online tool to export data from WHO member countries on a variety of measures. After exploring options, I honed in on Health Expenditures. We may be curious whether spending on Health has any noticeable correlation with maternal mortality outcomes. I manually downloaded the NHA_indicators.xlsx from their online tool.
reference: http://apps.who.int/nha/database/Home/Index/en source: http://apps.who.int/nha/database/ViewData/Indicators/en
# reference: http://apps.who.int/nha/database/Home/Index/en
# source: http://apps.who.int/nha/database/ViewData/Indicators/en
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
#source_url <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn <- 'NHA_indicators.xlsx'
cache_fn <- 'nha_indicators.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 2 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 0 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('Country','Indicator','Note','Year_2000','Year_2005','Year_2010', 'Year_2015')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Since data came from an online tool, we don't have a specific file we can download
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
print(msg)
## [1] "Found cached copy: ./data/cache/nha_indicators.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
nha_df <- data_df
print(msg)
## [1] "Cached CSV data loaded."
# Remove to extra Note column we won't need
nha_df <- nha_df %>% select(-Note)
# Convert year columns into a variable
nha_df <- nha_df %>%
gather(Year, Amount, -Country, -Indicator) %>%
mutate(Year = stringr::str_replace(Year, "Year_", "")) %>%
drop_na()
# Fix datatypes
nha_df$Year <- as.numeric(nha_df$Year)
nha_df$Amount <- as.numeric(nha_df$Amount)
# Break out CHE into a separate column
exp_data <- nha_df %>% filter(Indicator == 'Current Health Expenditure (CHE) per Capita in PPP') %>%
mutate(CHE = Amount) %>%
select(-Indicator, -Amount) %>%
arrange(Country, Year) %>%
drop_na()
# Break out GDP into a separate column
gdp_data <- nha_df %>% filter(Indicator == 'Gross Domestic Product') %>%
mutate(GDP = Amount) %>%
select(-Indicator, -Amount) %>%
arrange(Country, Year) %>%
drop_na()
# Join the columns back into a table based on Country & Year
nha_df <- gdp_data %>%
left_join(exp_data) %>%
left_join(iso3_codes)knitr::kable(nha_df[1:10, ], booktabs = T, caption = "WHO Reported Health Expenditures & GDP by Country & Year. Tidy data, first 10 rows shown. Note CHE are adjusted with Purchase Price Parity (PPP) so dollars have equivalent spending power per country. This is also sometimes referred to as the 'Big Mac Index' (Source: http://apps.who.int/nha/database/ViewData/Indicators/en)")| Country | Year | GDP | CHE | ISO3 | ISO2 |
|---|---|---|---|---|---|
| Afghanistan | 2005 | 24851.345 | 98.61208 | AFG | AF |
| Afghanistan | 2010 | 44457.152 | 132.27128 | AFG | AF |
| Albania | 2000 | 11925.958 | 258.28327 | ALB | AL |
| Albania | 2005 | 17663.350 | 363.84904 | ALB | AL |
| Albania | 2010 | 28073.799 | 478.31474 | ALB | AL |
| Algeria | 2000 | 252385.066 | 282.38500 | DZA | DZ |
| Algeria | 2005 | 365236.389 | 354.95766 | DZA | DZ |
| Algeria | 2010 | 455452.366 | 645.28783 | DZA | DZ |
| Andorra | 2000 | 2132.781 | 3049.00618 | AND | AD |
| Andorra | 2005 | 3444.711 | 3741.40393 | AND | AD |
Note: PPP refers to “Price Point Parity”. This country-by-country adjustment reflects different purchasing power in different geo’s and attempts to normalize spending for better comparison across geo’s. We know intuitively that cost of living and items vary greatly around the world and $1 in the US doesn’t NOT have the same purchasing power as $1 in India. PPP normalizes so that spending can be compared across countries.
Note that we are combining data from a number of sources and years. As such there will be a number of NA’s where data is missing in various columns. Here I also calculate the actual number of maternal deaths based on populations, birth rates and MMR rates.
# Use the current population by year * birthrate by year to get actual births
births_df <- birthrate_df %>%
inner_join(population_df) %>%
mutate(Births = round(Population / 1000 * BirthRate)) %>%
select(ISO3, Country, Year, Population, BirthRate, Births)
births_df <- births_df %>% left_join(iso3_codes)
# now that we have births, we can multiply by MMR (deaths per 100,000 births)
mmr_df <- mmr_df %>%
left_join(births_df) %>%
mutate(Deaths = round(Births / 10000 * MMR))
# join in Skilled Attendant at Birth (SAB)
mmr_df <- mmr_df %>%
left_join(sab_df) %>%
left_join(nha_df)knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality table with supporting columns.")| ISO3 | Country | Year | MMR | Population | BirthRate | Births | ISO2 | Deaths | SAB_Pct | GDP | CHE |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AFG | Afghanistan | 2000 | 1450 | 22462000 | 48.021 | 1078648 | AF | 156404 | NA | NA | NA |
| AFG | Afghanistan | 2005 | 1140 | 26335000 | 44.723 | 1177780 | AF | 134267 | NA | 24851.34 | 98.61208 |
| AFG | Afghanistan | 2010 | 954 | 29121000 | 39.829 | 1159860 | AF | 110651 | 39.00000 | 44457.15 | 132.27128 |
| AFG | Afghanistan | 2015 | 701 | 32565000 | 34.809 | 1133555 | AF | 79462 | NA | NA | NA |
| ALB | Albania | 2000 | 23 | 3159000 | 16.436 | 51921 | AL | 119 | 99.12912 | 11925.96 | 258.28327 |
| ALB | Albania | 2005 | 22 | 3025000 | 12.821 | 38784 | AL | 85 | 99.75600 | 17663.35 | 363.84904 |
| ALB | Albania | 2010 | 21 | 2987000 | 12.001 | 35847 | AL | 75 | NA | 28073.80 | 478.31474 |
| ALB | Albania | 2015 | 15 | 3030000 | 12.197 | 36957 | AL | 55 | NA | NA | NA |
| DZA | Algeria | 2000 | 161 | 30639000 | 19.554 | 599115 | DZ | 9646 | NA | 252385.07 | 282.38500 |
| DZA | Algeria | 2005 | 127 | 32918000 | 20.774 | 683839 | DZ | 8685 | NA | 365236.39 | 354.95766 |
We generally see a decline in mortality rates from 2000 to 2015 data. We observe in both the histogram and density plots that the weight is shifted towards lower MMR in 2015 vs 2000.
data <- mmr_df %>% filter(Year==2000 | Year==2015) %>% select(MMR, Year)
data$Year <- as.factor(data$Year)
ggplot(data=data, aes(x=MMR, color=Year)) +
geom_histogram(fill='white', alpha=0.2, position="identity") +
scale_color_discrete(name="Year", labels=c(2000, 2015)) + labs(title = "2000 and 2015 Maternal Mortality Rates (MMR)") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(data=data, aes(x=MMR, color=Year)) +
geom_density(fill='white', alpha=0.2, position="identity") +
scale_color_discrete(name="Year", labels=c(2000, 2015)) + labs(title = "2000 and 2015 Maternal Mortality Rates (MMR)") +
theme(plot.title = element_text(hjust = 0.5))We can visualize MMR’s on a world map to get a sense for geographical distribution and where higher or lower MMR’s are more prevalent.
data <- mmr_df %>%
select(Country, MMR, Year) %>%
rename(region = Country,
value = MMR) %>%
mutate(region = tolower(region)) %>%
mutate(region = recode(region,
"united states" = "united states of america",
"congo, dem. rep." = "democratic republic of the congo",
"congo, rep." = "republic of congo",
"korea, dem. rep." = "south korea",
"korea. rep." = "north korea",
"tanzania" = "united republic of tanzania",
"serbia" = "republic of serbia",
"slovak republic" = "slovakia",
"yemen, rep." = "yemen"))
country_choropleth(data %>% filter(Year == 2000), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2000, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))
country_choropleth(data %>% filter(Year == 2005), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2005, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))
country_choropleth(data %>% filter(Year == 2010), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2010, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))
country_choropleth(data %>% filter(Year == 2015), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2015, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))Let’s see which countries have the most reduction in maternal deaths by taking 2015 data and subtracting 2000 data to find the delta for each country over the 15 years. This is a rough measure of overall trends over the 15 years.
# Group data by country and calculate the delta in deaths over the years
tbl <- mmr_df %>%
group_by(Country) %>%
dplyr::summarize(delta = round(first(Deaths) - last(Deaths))) %>%
arrange(desc(delta)) %>%
drop_na()
# barplot
ggplot(tbl %>% top_n(10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
ggtitle("Top 10 Countries with Reductions in Deaths") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.1, vjust = 0.5 ) +
#scale_y_discrete(labels = comma, breaks=c(1000, 100000, 200000, 600000)) +
ylim(c(0,700000)) +
xlab('Country') +
ylab('Deaths') +
theme(legend.position = "none") +
coord_flip() +
scale_y_continuous(labels = comma)… and those with increases in maternal deaths. Something profound has happened in India over these years and Idia tops the list with the largest decrease in deaths. At this point, we don’t know why, but that is a very positive trend. Nigeria on the other hand is clearly experiencing a crisis.
# barplot
ggplot(tbl %>% top_n(-10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
ggtitle("Bottom 10 Countries with Increases in Deaths") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.5, vjust = 0.5 ) +
scale_y_discrete(labels = comma, breaks=c(-30000, -20000, -1000)) +
ylim(c(-25000,0)) +
xlab('Country') +
ylab('Deaths') +
theme(legend.position = "none") +
coord_flip()data <- tbl %>%
filter(delta > 0) %>%
rename(region = Country,
value = delta) %>%
mutate(region = tolower(region),
value = value) %>%
mutate(region = recode(region,
"united states" = "united states of america",
"congo, dem. rep." = "democratic republic of the congo",
"congo, rep." = "republic of congo",
"korea, dem. rep." = "south korea",
"korea. rep." = "north korea",
"tanzania" = "united republic of tanzania",
"serbia" = "republic of serbia",
"slovak republic" = "slovakia",
"yemen, rep." = "yemen"))
country_choropleth(data, num_colors = 9) +
scale_fill_brewer(palette="Greens") +
labs(title = paste("MMR Improvements (2000-2015)"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))data <- tbl %>%
filter(delta < 0) %>%
rename(region = Country,
value = delta) %>%
mutate(region = tolower(region),
value = -value) %>%
mutate(region = recode(region,
"united states" = "united states of america",
"congo, dem. rep." = "democratic republic of the congo",
"congo, rep." = "republic of congo",
"korea, dem. rep." = "south korea",
"korea. rep." = "north korea",
"tanzania" = "united republic of tanzania",
"serbia" = "republic of serbia",
"slovak republic" = "slovakia",
"yemen, rep." = "yemen"))
country_choropleth(data, num_colors = 9) +
scale_fill_brewer(palette="Reds") +
labs(title = paste("MMR Losses (2000-2015)"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))Let’s start with a simple correlation matrix to understand which features correlate positively with MMR (i.e. increase dates of death) or negatively (decreased rates of death). Note that we are particularly interested in features in RED that negatively correlate with MMR. We see that Country Health Expenditures (CHE), Skilled Attendant at Birth (SAB_Pct) and GDP appear as promising features to explore. Births have a positive correlation – more births correlating with a higher rate of moms dying. Since MMR is a rate (deaths per 100k births), we might have assumed that the number of births would have no direct affect on the rate of women dying. Since we cannot attribute causation, we can merely say that in countries with more births, there is an increase in the rate of deaths. Maybe more births tax limited resources? This might be a different research question to explore in a later study. Interestingly, Population doesn’t seem to correlate with MMR. So higher population doesn’t seem to be a factor in explaining MMR.
mmr_data <- mmr_df %>%
select(Population, Births, GDP, CHE, MMR, SAB_Pct) %>%
drop_na()
mmr_cor <- cor(mmr_data, method = c("spearman"))
corrplot(mmr_cor)
palette = colorRampPalette(c("red", "white", "green")) (20)
heatmap(x = mmr_cor, col = palette, symm = TRUE)We see that generally, countries spend less than $4000 (PPP) adjusted dollars per person on health expenditures and as GDP increases, there is an increase in CHE. We see some clear outliers with very high GDP countries spending very little and vice versa, countries with relatively low GDP spending reporting quite high CHE. While this is an interesting pattern, I suspect that CHE is not entirely reported nor being spent on the poorest population that might be more impacted by MMR. Unfortunately, I don’t have the data set to fully explore those questions.
ggplot(mmr_df, aes(GDP, CHE)) +
geom_point() +
labs(title='Country Health Expenditure (PPP) vs Country GDP (PPP)') +
xlab('GDP (Millions, PPP)') +
ylab('CHE (Dollars, PPP)') +
scale_x_continuous(labels = comma) +
theme(plot.title = element_text(hjust = 0.5))We do see a pattern where higher reported health expenditures correlate with lower maternal mortality and inversely, those countries (and years) with lower CHE see higher MMR. The inflection point visually starts at ~$1000, but rises most steeply at ~$500, under which the MMR spikes. Countries spending less than $1000 appear to have higher rates of MMR.
ggplot(mmr_df, aes(CHE, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate (MMR) vs Country Health Expenditure (PPP)') +
xlab('CHE (Dollars, PPP)') +
ylab('MMR (per 100k live births)') +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth()Since MMR is a rate, let’s look at the count of deaths compared with health expenditures. The death count increases most dramatically with lower health expenditures and population impacts. There are a handful of countries with a high number of deaths per year and those correlate with the lowest outlay of health expenditures.
ggplot(mmr_df, aes(CHE, Deaths)) +
geom_point() +
labs(title='Maternal Deaths vs Country Health Expenditures (PPP)') +
xlab('CHE (Dollars, PPP)') +
ylab('Maternal Deaths') +
theme(plot.title = element_text(hjust = 0.5))If we filter out the outliers with huge numbers of deaths, we still visually see that the inflection point is ~$1000.
mmr_df %>%
filter(Deaths < 100000) %>%
ggplot(aes(CHE, Deaths)) +
geom_point() +
labs(title='Maternal Deaths vs Country Health Expenditures (PPP)', subtitle='< 100,000 deaths') +
xlab('CHE (Dollars, PPP)') +
ylab('Maternal Deaths') +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))Generally the Maternal Mortality rate is higher with counties having lower GDP. However, there are plenty of countries with low GDP and low MMR, so clearly GDP doesn’t tell the whole story.
ggplot(mmr_df, aes(GDP, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate (MMR) vs Country GDP (Millions, PPP)') +
xlab('GDP (Millions, PPP)') +
ylab('MMR (per 100k live births)') +
scale_x_continuous(labels = comma) +
theme(plot.title = element_text(hjust = 0.5))While we only have a limited data set with Skilled Attendant at Birth percentages, the data we do have shows a linear trend. As the percentage of births managed by a health care provide increases, the MMR rate and number of deaths decrease. with lowered maternal mortality when there is a higher percentage of births managed by a health professional. That said, a Skilled Attendant at birth may only be able to do so much … there may be affects earlier in pregnancy leading to mortality. Notice that the linear trendline approximates the non-linear over most of the domain of SAB_Pct.
ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, MMR)) +
geom_point() +
labs(title = 'Maternal Mortality Rate vs Skilled Attendants at Birth (SAB)',
caption = 'Note: we only have 98 data points with both SAB and MMR',
fill = 'Model Type') +
xlab('Skilled Attendant at Birth (%)') +
ylab('MMR (per 100k live births)') +
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.0)) +
stat_smooth(aes(colour = 'non-linear')) +
stat_smooth(method=lm,
aes(colour = 'linear'))As with mortality rate, we generally see fewer reported deaths as the percent of births are handled by a trained health professional. If we consider all data points, then we clealy have a non-linear relationship and a linear model would not be appropriate. However, the data approximates a linear model once the Skilled Attendants is over ~20%.
ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, Deaths)) +
geom_point() +
scale_y_continuous(labels = comma) +
ylim(0, 300000) +
labs(fill = "Model Type", title="Maternal Deaths vs Skilled Attendants at Birth (SAB)") +
xlab('Skilled Attendant at Birth (%)') +
ylab('Maternal Deaths') +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
stat_smooth(method = gam,
formula = y ~ log(x),
aes(colour = 'non-linear')) +
stat_smooth(method=lm,
aes(colour = 'linear'))
ggplot(mmr_df %>% drop_na() %>% filter(SAB_Pct > 20),
aes(SAB_Pct, Deaths)) +
geom_point() +
scale_y_continuous(labels = comma) +
ylim(0, 300000) +
labs(title='Maternal Deaths vs Skilled Attendants at Birth (SAB)',
subtitle = "> 20% Skilled Attendants",
fill = 'Model Type') +
xlab('Skilled Attendant at Birth (%)') +
ylab('Maternal Deaths') +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
stat_smooth(method = gam,
formula = y ~ log(x),
aes(colour = 'non-linear')) +
stat_smooth(method=lm,
aes(colour = 'linear'))Let’s build simple models with the known features that helps to explain MMR and explore which features might have more influence on outcomes. This type of analysis would be useful for policy makers to design followup experiments that attempt to reduce MMR.I’ll start by exploring each feature separately, then combine and do multiple regression.
Population is highly skewed which isn’t ideal for regression analysis. For linear regressions, we ideally want a normally distributed data set - when we log transform Population, the distribution is approximately normal with some bumps on the far right. We see this in the Q-Q plot as well. We will go ahead with a linear model and explore how much Population influences MMR over most of the data points. Interestingly, Population is a significant factor and as Population increases, MMR increases. However, with an adjusted R\(^2\) of 0.008236, Population explains less than 1% of the variability in the model and we see that the residuals are not normally distributed suggesting that a linear model with Population probably isn’t appropriate. Given this, we won’t consider Population in our multiple regression model.
# Run Linear Regression using Population to predict DeathRate
pop_lm <- lm(MMR ~ log(Population), data = mmr_df, rm.na = TRUE)
(pop_lm <- summary(pop_lm))
##
## Call:
## lm(formula = MMR ~ log(Population), data = mmr_df, rm.na = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -281.23 -199.69 -141.84 76.82 2273.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -61.757 116.009 -0.532 0.595
## log(Population) 17.731 7.336 2.417 0.016 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 323.4 on 582 degrees of freedom
## (156 observations deleted due to missingness)
## Multiple R-squared: 0.009937, Adjusted R-squared: 0.008236
## F-statistic: 5.841 on 1 and 582 DF, p-value: 0.01596
residPlot(pop_lm)As with Population, Births are also highly skewed. We will again use a log transformation as we explore the relationship of Births and MMR. log(Births) is highly significant and as Births increase, MMR increases and in this simple model we see an adjusted R\(^2\) of 0.08013. On it’s own, Births explains < 10% of variability we see in MMR. Of concern is that our residuals are not normally distributed as seen in the Q-Q plot. If we want to try using Births we would need to introduce a non-linear regression. We will not include Births in any combined models.
# Run Linear Regression using Population to predict DeathRate
birth_lm <- lm(MMR ~ log(Births), data = mmr_df, rm.na = TRUE)
(birth_lm <- summary(birth_lm))
##
## Call:
## lm(formula = MMR ~ log(Births), data = mmr_df, rm.na = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -423.00 -186.05 -109.10 73.43 2252.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -362.742 81.551 -4.448 1.04e-05 ***
## log(Births) 49.084 6.821 7.196 1.91e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 311.4 on 582 degrees of freedom
## (156 observations deleted due to missingness)
## Multiple R-squared: 0.08171, Adjusted R-squared: 0.08013
## F-statistic: 51.79 on 1 and 582 DF, p-value: 1.915e-12
residPlot(birth_lm)In our exploratory analysis, SAB_Pct seemed to be highly negatively correlated with MMR. We find that SAB_Pct is highly significant and explains ~53% of variability in MMR. We should note that we had limited data for SAB_Pct, but having skilled attendants at birth appears to be an important feature related to lowered MMR. In reality, a linear model is probably not the best choice since the data is not normally distributed and we see a spike ~100%. That said, the data is close to a uniform distribution. I’ll use a linear model with the caveat that its an approximation. We note that the residuals have a few bumps in the extremes, but are closer to normal. The Q-Q plot is somewhat linear through the central region. If we were looking to make a predictive model, we might explore removing outliers that are affecting the extremes or using a non-linear model (e.g. GAM). For a simple analysis to answer whether SAB_Pct is an important feature, this is probably sufficient.
# Run Linear Regression using Population to predict DeathRate
sab_lm <- lm(MMR ~ SAB_Pct, data = mmr_df)
(sab_lm <- summary(sab_lm))
##
## Call:
## lm(formula = MMR ~ SAB_Pct, data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -584.43 -126.32 -28.24 91.64 1716.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1276.450 85.427 14.94 <2e-16 ***
## SAB_Pct -12.296 1.179 -10.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 310.4 on 96 degrees of freedom
## (642 observations deleted due to missingness)
## Multiple R-squared: 0.5312, Adjusted R-squared: 0.5263
## F-statistic: 108.8 on 1 and 96 DF, p-value: < 2.2e-16
residPlot(sab_lm)Since CHE is highly skewed, we will use the log transform of CHE in our model. The log transformed values are approximately normally distributed. We find that in a simple model, log(CHE) is highly significant and as health expenditures increase, MMR decreases. The model showed an adjusted R\(^2\) of 0.4366 meaning that CHE on it’s own accounts for ~44% of the variability we see in MMR by country. In our residuals, we do see some skew and bumps in the upper regions suggesting some outliers. However, the distribution is mostly normal and the Q-Q plot linear through most of the residual points.
# Run Linear Regression using Population to predict DeathRate
che_lm <- lm(MMR ~ log(CHE), data = mmr_df)
(che_lm <- summary(che_lm))
##
## Call:
## lm(formula = MMR ~ log(CHE), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.18 -138.63 -30.99 80.83 2078.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1161.193 53.900 21.54 <2e-16 ***
## log(CHE) -158.228 8.788 -18.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240 on 416 degrees of freedom
## (322 observations deleted due to missingness)
## Multiple R-squared: 0.438, Adjusted R-squared: 0.4366
## F-statistic: 324.2 on 1 and 416 DF, p-value: < 2.2e-16
residPlot(che_lm)GDP is highly skewed, so we will use the log transform. The log transformation yielded values that are close to normally distributed. GDP is a significant factor, but only explains about ~9% of the variability we see in MMR. As a country’s GDP increases, we do see a decrease in MMR. However, the residual plots suggest that a linear treatment might not be appropriate for GDP if we were looking for a true predictive model. For a simple yes/no as to whether GDP is a significant factor, we can probably say yes, but that it doesn’t account for much of the variability.
# Run Linear Regression using Population to predict DeathRate
gdp_lm <- lm(MMR ~ log(GDP), data = mmr_df)
(gdp_lm <- summary(gdp_lm))
##
## Call:
## lm(formula = MMR ~ log(GDP), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -366.5 -176.5 -103.9 97.9 2159.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 712.000 78.320 9.091 < 2e-16 ***
## log(GDP) -46.197 7.129 -6.480 2.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 304.8 on 417 degrees of freedom
## (321 observations deleted due to missingness)
## Multiple R-squared: 0.09149, Adjusted R-squared: 0.08931
## F-statistic: 41.99 on 1 and 417 DF, p-value: 2.59e-10
residPlot(gdp_lm)Let’s perform a multiple regression combining the features that independently showed significance and seemed appropriate for a linear model: SAB_Pct and CHE. In this model, we explain ~55% of the variability and the model as a whole is very significant with a p-value close to 0. However, we only find SAB_Pct to be significant. We had previously noted that health expenditures (CHE) and Skilled attendants at births (SAB_PCT) were highly auto-correlated. As such, the model has arbitrarily attributed the effects most strongly with SAB_Pct and has treated CHE as not significant. This is a limitation of simple linear regressions.
# Run Linear Regression using Population to predict DeathRate
combined_lm <- lm(MMR ~ SAB_Pct + log(CHE), data = mmr_df)
(combined_lm <- summary(combined_lm))
##
## Call:
## lm(formula = MMR ~ SAB_Pct + log(CHE), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -575.23 -122.94 -17.92 74.31 1677.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1240.050 240.568 5.155 2.43e-06 ***
## SAB_Pct -13.819 2.375 -5.818 1.83e-07 ***
## log(CHE) 28.955 69.185 0.419 0.677
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 324.5 on 67 degrees of freedom
## (670 observations deleted due to missingness)
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5452
## F-statistic: 42.37 on 2 and 67 DF, p-value: 1.281e-12
residPlot(combined_lm)As an exploratory effort, let’s try fitting a Generalized Additive Model to the features to see whether we can tease out the influence of CHE vs SAB_Pct. Unfortunately, we again are not able to differentiate the effect of SAB_Pct from CHE, there is just too much overlap between their influences.
# Build the model
model <- gam(MMR ~ s(SAB_Pct) + s(log(CHE)), data = mmr_df)
summary(model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## MMR ~ s(SAB_Pct) + s(log(CHE))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 447.16 38.69 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(SAB_Pct) 1.000 1.000 34.151 1.21e-07 ***
## s(log(CHE)) 1.265 1.488 0.116 0.741
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.548 Deviance explained = 56.2%
## GCV = 1.0989e+05 Scale est. = 1.0476e+05 n = 70
ggplot(mmr_df, aes(SAB_Pct, MMR) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))
ggplot(mmr_df, aes(log(CHE), SAB_Pct) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))Of the features explored (Population, GDP, Births, Health Expenditures and Skilled Attendant at Birth), I found that Health Expenditures and Skilled Attendant at Birth have the highest correlation with reduced Maternal Mortality Rates at child birth. These features were both signification and together explained ~55% of the variability we see in MMR across countries.
Of note, Skilled Attendants explained more of the variability in MMR (higher R\(^2\)) than CHE; however, we had very few datapoints with Skilled Attendants. On the other hand, CHE explained less (lower R\(^2\)), but we had many more data points. Given this, it would be worthwhile to either track down additional data for Skilled Attendants, find a proxy measure for Skilled Attendants or design followup studies/surverys to gather this data. Because we have so few data points for Skilled Attendants we cannot be sure it is a random sample – possibly the data set is biased where Skilled Attendants are reported in countries that have better outcomes. The fact that Skilled Attendants had the highest explanatory value, there is something going on that probably justifies further investigation.
While we cannot attribute causality, modeling would suggest these are areas for further experimental studies. To a less extent, we saw the increased Births had a positive correlation with MMR. It would be worthwhile to explore other data sources or do followup analysis to understand why more births might be associated with higher death rates.