PH 251D Project 2

1. Introduction

For my project I will be using R to analyze ecologic study data on how the proportion of pregant women that undergo a cesarean delivery (c-section) varies across 137 counties with varying GDP levels. A recent study in the British Medical Journal found that cesarean delivery rates ranged between 0.6% in South Sudan and 58.9% in the Dominican Republic.(1) In the US, the rate is about 33%. Economic inequity may be a factor that explains lower c-section rates, as lower income populations may not have the same access to medical services that are able to provide c-sections as higher income populations. The GDPs of these countries may help explain some of the great variation observed in c-section rates across the globe.

2. Analysis (R Code) & Explanations

library(dplyr)
library(ggplot2)
library(readr)
library(broom)
library(tidyr)

Dictionary of Variables in the Data Set

Variable	Description
Country_Name	The name of the country
CountryCode	3-digit unique country identifier
Births_Per_1000	The number of births per 1000 people in the country
Income_Group	The income group the country is categorized into
Region	The geographic region the country belongs to
GDP_2006	The gross domestic product (GDP) of the country in 2006
CS_rate	The percent of births that were delivered by cesarean section
Income_Group_order	The “ordered” version of Income_Group

Cesarean_data <- read.csv("~/R_Workspace/Cesarean_data.csv")

Cesarean_data <- Cesarean_data %>%
mutate(Income_Group_order = forcats::fct_relevel(Income_Group, "Low income",
"Lower middle income", "Upper middle income",
"High income: nonOECD", "High income: OECD"))

First I re-orders the variable Income_Group in the specified order so that the data will be plotted in a logical way.

str(Cesarean_data)

## 'data.frame':    137 obs. of  8 variables:
##  $ Country_Name      : Factor w/ 137 levels "Albania","Algeria",..: 1 3 128 4 5 6 7 8 12 13 ...
##  $ CountryCode       : Factor w/ 137 levels "ALB","AND","ARE",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Births_Per_1000   : int  46 1 63 689 47 267 76 166 119 342 ...
##  $ Income_Group      : Factor w/ 5 levels "High income: nonOECD",..: 5 1 1 1 4 2 2 5 2 3 ...
##  $ Region            : Factor w/ 7 levels "East Asia & Pacific",..: 2 2 4 3 2 1 2 2 2 7 ...
##  $ GDP_2006          : num  3052 42417 42950 6649 2127 ...
##  $ CS_rate           : num  0.256 0.237 0.1 0.352 0.141 0.303 0.271 0.076 0.159 0.036 ...
##  $ Income_Group_order: Factor w/ 5 levels "Low income","Lower middle income",..: 3 4 4 4 2 5 5 3 5 1 ...

dim(Cesarean_data)

## [1] 137   8

names(Cesarean_data)

## [1] "Country_Name"       "CountryCode"        "Births_Per_1000"   
## [4] "Income_Group"       "Region"             "GDP_2006"          
## [7] "CS_rate"            "Income_Group_order"

head(Cesarean_data)

##           Country_Name CountryCode Births_Per_1000         Income_Group
## 1              Albania         ALB              46  Upper middle income
## 2              Andorra         AND               1 High income: nonOECD
## 3 United Arab Emirates         ARE              63 High income: nonOECD
## 4            Argentina         ARG             689 High income: nonOECD
## 5              Armenia         ARM              47  Lower middle income
## 6            Australia         AUS             267    High income: OECD
##                       Region  GDP_2006 CS_rate   Income_Group_order
## 1      Europe & Central Asia  3051.768   0.256  Upper middle income
## 2      Europe & Central Asia 42417.229   0.237 High income: nonOECD
## 3 Middle East & North Africa 42950.101   0.100 High income: nonOECD
## 4  Latin America & Caribbean  6649.414   0.352 High income: nonOECD
## 5      Europe & Central Asia  2126.619   0.141  Lower middle income
## 6        East Asia & Pacific 36100.559   0.303    High income: OECD

Here I am simply showing some of the details and structure of the data set.

Right now, CS_rate is presented as a number between 0 and 1. For plotting, it will be nice to have the proportion displayed as a number between 0 and 100. I used dplyr’s mutate() function to add a new variable called CS_rate_100 to the dataset that takes this format.

Cesarean_data <- Cesarean_data %>% mutate(CS_rate_100 = CS_rate*100)

Cesarean_data %>% pull(Income_Group_order) %>% levels

## [1] "Low income"           "Lower middle income"  "Upper middle income" 
## [4] "High income: nonOECD" "High income: OECD"

ggplot(Cesarean_data, aes(x = Income_Group_order)) + geom_bar(col = "white", fill = "royalblue") +
  labs(y = "Count of countries", x = "Income Level",
  title = "Distribution of countries by income level") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

ggplot(Cesarean_data, aes(x = Region)) + geom_bar(col = "white", fill = "royalblue") +
  labs(y = "Count of countries", x = "Region",
  title = "Distribution of countries by region") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

ggplot(Cesarean_data, aes(x = GDP_2006)) +
  geom_histogram(binwidth = 2000, col = "white", fill = "royalblue") +
  labs(y = "Count of countries", x = "Gross Domestic Product ($)",
  title = "Distribution of GDP across countries") +
  theme_minimal()

ggplot(Cesarean_data, aes(x = CS_rate_100)) +
  geom_histogram(binwidth = 4, col = "white", fill = "royalblue") +
  labs(y = "Count of countries", x = "Cesarean delivery rate (%)",
  title = "Distribution of cesarean delivery rates across countries") +
  theme_minimal()

These are descriptive graphs showing some of the distributions of the various variables in the data set.

When looking at the “Distribution of cesarean delivery rates across countries” graph, we can see there appears to be two peaks (bimodal). I will break up the data by income group to tease out why this double peak may exist.

ggplot(data = Cesarean_data, aes(x = CS_rate_100)) +
geom_histogram(binwidth = 4, col = "white", fill = "royalblue") +
xlab("Cesarean delivery rate (%)") +
theme_minimal() +
facet_wrap(vars(Income_Group_order))

Based on this plot and the previous one, the data has two peaks because many low income countries have a very low cesarean delivery rate while at the same time many upper middle and and high income countries have a much higher cesarean delivery rate. This points to a great inequality in c-section rates based on income/economic inequality. Because they have lower access to healthcare in general and cesarean delivery requires trained surgeons and adequate hospitals to preform, something many people in lower income nations cannot afford or simply do not have access to in their countries.

ggplot(data = Cesarean_data, aes(x = GDP_2006, y = CS_rate_100)) +
geom_point() + labs(y = "Cesarean delivery rate (%)", x = "Gross Domestic Product ($)")

We can see that many of the points are condensed towards the lower left corner. The distributions of both cesarean delivery rate and GDP cover a wide range of values. Both of these variables are good candidates for log transformations to spread out the range of data at the lowest levels.

In the next code chunk, I used the mutate() function to add two new logged variables to the data set Cesarean_data I called the variables log_CS and log_GDP. To do this I used base e, also know as natural logarithms, to create the logged variables.

Cesarean_data <- Cesarean_data %>% mutate(log_CS = log(CS_rate_100),
log_GDP = log(GDP_2006))

ggplot(Cesarean_data, aes(x = log_GDP, y = log_CS)) +
geom_point() + labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)") + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

After logging both variables I plotted them into a scatter plot and fit a curve to the data so we can visualize their relationship. We can see that the relationship between logged GDP and logged CS looks linear for the first 75% of the data. After that the slope flattens out. So overall the relationship is not linear.

ggplot(data = Cesarean_data, aes(x = log_GDP, y = log_CS)) +
  geom_point(aes(col = Income_Group)) + 
  labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)")  + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Here I’ve taken that same scatter plot and linkied the color of the points to the variable Income_Group. Based on this colored scatter plot, it looks like the relationship is linear for those countries that are not categorized as one of the two high income categories.

I would like to use linear regression on the data. To do this, I used a dplyr function to make a new data set called Cesarean_data_sub that only contains the low-, lower-middle, and upper-middle income countries, since the high income categories are not linear with logged GDP. I will then remake the last scatter plot, this time using Cesarean_data_sub to see if the relationship looks approximately linear between the logged variables.

Cesarean_data_sub <- Cesarean_data %>% filter(Income_Group %in% c("Low income",
"Lower middle income",
"Upper middle income"))

ggplot(data = Cesarean_data_sub, aes(x = log_GDP, y = log_CS)) +
  geom_point(aes(col = Income_Group)) + 
  labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)")  + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Given that the relationship is approximately linear, I will use linear regression to model the relationship between log_CS as the response variable and log_GDP as the explanatory variable.

lm <- lm(formula = log_CS ~ log_GDP, data = Cesarean_data_sub)
tidy(lm)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -3.94     0.515      -7.66 2.18e-11
## 2 log_GDP        0.819    0.0706     11.6  1.72e-19

glance(lm)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
## *     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.602         0.597 0.691      135. 1.72e-19     2  -94.5  195.  203.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

Based on my linear model, a one unit increase in the natural log of GDP is associated with a 0.82 unit increase in the natural log of the cesarean delivery rate. Additionally, based on our R squared we can say that 60.19% of the variation in C-section rate is explained by the regression model.

lm_augment <- augment(lm)

# Scatterplot
ggplot(data = lm_augment, aes(x = log_GDP, y = log_CS)) +
  geom_point(alpha = 0.5)  + 
  geom_segment(aes(xend = log_GDP, yend = .fitted), lty = 2, alpha = 0.5)  + 
  labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)", title = "(a) Scatterplot") +
  geom_smooth(method = "lm", se = F)

# QQ Plot
ggplot(lm_augment, aes(sample = .resid)) +
  geom_qq() +
  geom_qq_line() +
  theme_minimal(base_size = 15) +
  labs(y = "Residuals", x = "Theoretical quantiles", title = "(b) QQplot")

# Fitted vs Residuals
ggplot(lm_augment, aes(y = .resid, x = .fitted)) +
  geom_point() +
  theme_minimal(base_size = 15) +
  geom_hline(aes(yintercept = 0)) +
  labs(y = "Residuals", x = "Fitted values", title = "(c) Fitted vs. residuals")

# Amount explained
gather <- lm_augment %>% select(log_CS, .resid) %>%
  gather(key = "type", value = "value", log_CS, .resid)

ggplot(gather, aes(y = value)) +
  geom_boxplot(aes(fill = type)) +
  theme_minimal(base_size = 15) +
  labs(title = "(d) Amount explained")

I have created plots so we can check the assumptions of the regression model and see if any of them are violated so we can decide if we can conduct an inference test.

Assumptions:

x and y have linear relation in population
y varies Normally about the line of best fit
Obs are independent
Standard deviation of the responses is the same for all values of x -> Can be examined by estimating residuals AND watch out for outliers.

Plot (a) shows a fitted regression line and the data. The estimated residuals are shown by the dashed lines. We want to see that the residuals are sometimes positive and sometimes negative with no trend in their location. We can see that our plot has no trend in the location of the residuals overall. Thus we have a linear relationship.

Plot (b) shows a QQ plot of the residuals (to check if they are Normally distributed). There seems to be some divergence from normality at the ends of the plot, however the middle values are tightly on the line, indicating it is roughly normal.

Plot (c) shows a plot of the fitted values vs. the residuals. We want this to look like a random scatter. If their is a pattern then an assumption has been violated. Our plot seems to have no detectable pattern and it seems to be randomly scattered. Thus our assumption is met.

Plot (d) shows a boxplot of the distribution of y vs. the distribution of the residuals. If x does a good job describing y, then the box plot for the residuals will be much shorter because the model fit is good. The model seems to fit decently well.

Thus we will perform a hypothesis test.

$H_0$ : $b$ = 0 (i.e., There is no association between C-section rate and GDP)

$H_a$ : $b$ $\neq$ 0 (i.e., There is an association between C-section rate and GDP)

tidy(lm)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -3.94     0.515      -7.66 2.18e-11
## 2 log_GDP        0.819    0.0706     11.6  1.72e-19

Interpretation: We found a p-value of p = 1.715801e-19. This is less than our cut-off value of $\alpha$ = 5%. So we reject the null hypothesis that there is no association between C-section rate and GDP and have evidence to support the alternative hypothesis that there is an association between C-section rate and GDP.

Conclusion:

GDP does explain some of the variation we see in c-section rates across countries, and GDP and c-section rate have a positive association that is partially explained by the linear model. Another model with more variables might better fit the data. Of course, without any casual model, we are only able to find associations, however these associations may hold promise for further analysis and studies.

4. References

Boatin AA, Schlotheuber A, Betran AP, Moller AB, Barros AJ, Boerma T, Torloni MR, Victora CG, Hosseinpoor AR. Within country inequalities in caesarean section rates: observational study of 72 low and middle income countries. BMJ. 2018;24(360):k55. DOI: 10.1136/bmj.k55.
Gibbons L, Belizan JM, Lauer JA, Betran AP, Merialdi M, Althabe F. The global numbers and costs of additionally needed and unnecessary caesarean sections performed per year: overuse as a barrier to universal coverage. World Health Organization; 2010. Available at: http://www.who.int/healthsystems/ topics/financing/healthreport/30C-sectioncosts.pdf. Accessed November 24, 2018.
The World Bank. GDP per capita (current US$). Washington, DC; Available at: http://data.worldbank. org/indicator/NY.GDP.PCAP.CD. Accessed November 24, 2018.

PH 251D Project 2

Maher Yassin

November 26, 2018