1. Introduction
For my project I will be using R to analyze ecologic study data on how the proportion of pregant women that undergo a cesarean delivery (c-section) varies across 137 counties with varying GDP levels. A recent study in the British Medical Journal found that cesarean delivery rates ranged between 0.6% in South Sudan and 58.9% in the Dominican Republic.(1) In the US, the rate is about 33%. Economic inequity may be a factor that explains lower c-section rates, as lower income populations may not have the same access to medical services that are able to provide c-sections as higher income populations. The GDPs of these countries may help explain some of the great variation observed in c-section rates across the globe.
2. Analysis (R Code) & Explanations
library(dplyr)
library(ggplot2)
library(readr)
library(broom)
library(tidyr)
Dictionary of Variables in the Data Set
| Variable | Description |
|---|---|
| Country_Name | The name of the country |
| CountryCode | 3-digit unique country identifier |
| Births_Per_1000 | The number of births per 1000 people in the country |
| Income_Group | The income group the country is categorized into |
| Region | The geographic region the country belongs to |
| GDP_2006 | The gross domestic product (GDP) of the country in 2006 |
| CS_rate | The percent of births that were delivered by cesarean section |
| Income_Group_order | The “ordered” version of Income_Group |
Cesarean_data <- read.csv("~/R_Workspace/Cesarean_data.csv")
Cesarean_data <- Cesarean_data %>%
mutate(Income_Group_order = forcats::fct_relevel(Income_Group, "Low income",
"Lower middle income", "Upper middle income",
"High income: nonOECD", "High income: OECD"))
First I re-orders the variable Income_Group in the specified order so that the data will be plotted in a logical way.
str(Cesarean_data)
## 'data.frame': 137 obs. of 8 variables:
## $ Country_Name : Factor w/ 137 levels "Albania","Algeria",..: 1 3 128 4 5 6 7 8 12 13 ...
## $ CountryCode : Factor w/ 137 levels "ALB","AND","ARE",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Births_Per_1000 : int 46 1 63 689 47 267 76 166 119 342 ...
## $ Income_Group : Factor w/ 5 levels "High income: nonOECD",..: 5 1 1 1 4 2 2 5 2 3 ...
## $ Region : Factor w/ 7 levels "East Asia & Pacific",..: 2 2 4 3 2 1 2 2 2 7 ...
## $ GDP_2006 : num 3052 42417 42950 6649 2127 ...
## $ CS_rate : num 0.256 0.237 0.1 0.352 0.141 0.303 0.271 0.076 0.159 0.036 ...
## $ Income_Group_order: Factor w/ 5 levels "Low income","Lower middle income",..: 3 4 4 4 2 5 5 3 5 1 ...
dim(Cesarean_data)
## [1] 137 8
names(Cesarean_data)
## [1] "Country_Name" "CountryCode" "Births_Per_1000"
## [4] "Income_Group" "Region" "GDP_2006"
## [7] "CS_rate" "Income_Group_order"
head(Cesarean_data)
## Country_Name CountryCode Births_Per_1000 Income_Group
## 1 Albania ALB 46 Upper middle income
## 2 Andorra AND 1 High income: nonOECD
## 3 United Arab Emirates ARE 63 High income: nonOECD
## 4 Argentina ARG 689 High income: nonOECD
## 5 Armenia ARM 47 Lower middle income
## 6 Australia AUS 267 High income: OECD
## Region GDP_2006 CS_rate Income_Group_order
## 1 Europe & Central Asia 3051.768 0.256 Upper middle income
## 2 Europe & Central Asia 42417.229 0.237 High income: nonOECD
## 3 Middle East & North Africa 42950.101 0.100 High income: nonOECD
## 4 Latin America & Caribbean 6649.414 0.352 High income: nonOECD
## 5 Europe & Central Asia 2126.619 0.141 Lower middle income
## 6 East Asia & Pacific 36100.559 0.303 High income: OECD
Here I am simply showing some of the details and structure of the data set.
Right now, CS_rate is presented as a number between 0 and 1. For plotting, it will be nice to have the proportion displayed as a number between 0 and 100. I used dplyr’s mutate() function to add a new variable called CS_rate_100 to the dataset that takes this format.
Cesarean_data <- Cesarean_data %>% mutate(CS_rate_100 = CS_rate*100)
Cesarean_data %>% pull(Income_Group_order) %>% levels
## [1] "Low income" "Lower middle income" "Upper middle income"
## [4] "High income: nonOECD" "High income: OECD"
ggplot(Cesarean_data, aes(x = Income_Group_order)) + geom_bar(col = "white", fill = "royalblue") +
labs(y = "Count of countries", x = "Income Level",
title = "Distribution of countries by income level") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
ggplot(Cesarean_data, aes(x = Region)) + geom_bar(col = "white", fill = "royalblue") +
labs(y = "Count of countries", x = "Region",
title = "Distribution of countries by region") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
ggplot(Cesarean_data, aes(x = GDP_2006)) +
geom_histogram(binwidth = 2000, col = "white", fill = "royalblue") +
labs(y = "Count of countries", x = "Gross Domestic Product ($)",
title = "Distribution of GDP across countries") +
theme_minimal()
ggplot(Cesarean_data, aes(x = CS_rate_100)) +
geom_histogram(binwidth = 4, col = "white", fill = "royalblue") +
labs(y = "Count of countries", x = "Cesarean delivery rate (%)",
title = "Distribution of cesarean delivery rates across countries") +
theme_minimal()
These are descriptive graphs showing some of the distributions of the various variables in the data set.
When looking at the “Distribution of cesarean delivery rates across countries” graph, we can see there appears to be two peaks (bimodal). I will break up the data by income group to tease out why this double peak may exist.
ggplot(data = Cesarean_data, aes(x = CS_rate_100)) +
geom_histogram(binwidth = 4, col = "white", fill = "royalblue") +
xlab("Cesarean delivery rate (%)") +
theme_minimal() +
facet_wrap(vars(Income_Group_order))
Based on this plot and the previous one, the data has two peaks because many low income countries have a very low cesarean delivery rate while at the same time many upper middle and and high income countries have a much higher cesarean delivery rate. This points to a great inequality in c-section rates based on income/economic inequality. Because they have lower access to healthcare in general and cesarean delivery requires trained surgeons and adequate hospitals to preform, something many people in lower income nations cannot afford or simply do not have access to in their countries.
ggplot(data = Cesarean_data, aes(x = GDP_2006, y = CS_rate_100)) +
geom_point() + labs(y = "Cesarean delivery rate (%)", x = "Gross Domestic Product ($)")
We can see that many of the points are condensed towards the lower left corner. The distributions of both cesarean delivery rate and GDP cover a wide range of values. Both of these variables are good candidates for log transformations to spread out the range of data at the lowest levels.
In the next code chunk, I used the mutate() function to add two new logged variables to the data set Cesarean_data I called the variables log_CS and log_GDP. To do this I used base e, also know as natural logarithms, to create the logged variables.
Cesarean_data <- Cesarean_data %>% mutate(log_CS = log(CS_rate_100),
log_GDP = log(GDP_2006))
ggplot(Cesarean_data, aes(x = log_GDP, y = log_CS)) +
geom_point() + labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)") +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
After logging both variables I plotted them into a scatter plot and fit a curve to the data so we can visualize their relationship. We can see that the relationship between logged GDP and logged CS looks linear for the first 75% of the data. After that the slope flattens out. So overall the relationship is not linear.
ggplot(data = Cesarean_data, aes(x = log_GDP, y = log_CS)) +
geom_point(aes(col = Income_Group)) +
labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)") +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here I’ve taken that same scatter plot and linkied the color of the points to the variable Income_Group. Based on this colored scatter plot, it looks like the relationship is linear for those countries that are not categorized as one of the two high income categories.
I would like to use linear regression on the data. To do this, I used a dplyr function to make a new data set called Cesarean_data_sub that only contains the low-, lower-middle, and upper-middle income countries, since the high income categories are not linear with logged GDP. I will then remake the last scatter plot, this time using Cesarean_data_sub to see if the relationship looks approximately linear between the logged variables.
Cesarean_data_sub <- Cesarean_data %>% filter(Income_Group %in% c("Low income",
"Lower middle income",
"Upper middle income"))
ggplot(data = Cesarean_data_sub, aes(x = log_GDP, y = log_CS)) +
geom_point(aes(col = Income_Group)) +
labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)") +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Given that the relationship is approximately linear, I will use linear regression to model the relationship between log_CS as the response variable and log_GDP as the explanatory variable.
lm <- lm(formula = log_CS ~ log_GDP, data = Cesarean_data_sub)
tidy(lm)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.94 0.515 -7.66 2.18e-11
## 2 log_GDP 0.819 0.0706 11.6 1.72e-19
glance(lm)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.602 0.597 0.691 135. 1.72e-19 2 -94.5 195. 203.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
Based on my linear model, a one unit increase in the natural log of GDP is associated with a 0.82 unit increase in the natural log of the cesarean delivery rate. Additionally, based on our R squared we can say that 60.19% of the variation in C-section rate is explained by the regression model.
lm_augment <- augment(lm)
# Scatterplot
ggplot(data = lm_augment, aes(x = log_GDP, y = log_CS)) +
geom_point(alpha = 0.5) +
geom_segment(aes(xend = log_GDP, yend = .fitted), lty = 2, alpha = 0.5) +
labs(y = "Logged Cesarean delivery rate (%)", x = "Logged Gross Domestic Product ($)", title = "(a) Scatterplot") +
geom_smooth(method = "lm", se = F)
# QQ Plot
ggplot(lm_augment, aes(sample = .resid)) +
geom_qq() +
geom_qq_line() +
theme_minimal(base_size = 15) +
labs(y = "Residuals", x = "Theoretical quantiles", title = "(b) QQplot")
# Fitted vs Residuals
ggplot(lm_augment, aes(y = .resid, x = .fitted)) +
geom_point() +
theme_minimal(base_size = 15) +
geom_hline(aes(yintercept = 0)) +
labs(y = "Residuals", x = "Fitted values", title = "(c) Fitted vs. residuals")
# Amount explained
gather <- lm_augment %>% select(log_CS, .resid) %>%
gather(key = "type", value = "value", log_CS, .resid)
ggplot(gather, aes(y = value)) +
geom_boxplot(aes(fill = type)) +
theme_minimal(base_size = 15) +
labs(title = "(d) Amount explained")
I have created plots so we can check the assumptions of the regression model and see if any of them are violated so we can decide if we can conduct an inference test.
Assumptions:
x and y have linear relation in population
y varies Normally about the line of best fit
Obs are independent
Standard deviation of the responses is the same for all values of x -> Can be examined by estimating residuals AND watch out for outliers.
Plot (a) shows a fitted regression line and the data. The estimated residuals are shown by the dashed lines. We want to see that the residuals are sometimes positive and sometimes negative with no trend in their location. We can see that our plot has no trend in the location of the residuals overall. Thus we have a linear relationship.
Plot (b) shows a QQ plot of the residuals (to check if they are Normally distributed). There seems to be some divergence from normality at the ends of the plot, however the middle values are tightly on the line, indicating it is roughly normal.
Plot (c) shows a plot of the fitted values vs. the residuals. We want this to look like a random scatter. If their is a pattern then an assumption has been violated. Our plot seems to have no detectable pattern and it seems to be randomly scattered. Thus our assumption is met.
Plot (d) shows a boxplot of the distribution of y vs. the distribution of the residuals. If x does a good job describing y, then the box plot for the residuals will be much shorter because the model fit is good. The model seems to fit decently well.
Thus we will perform a hypothesis test.
\(H_0\) : \(b\) = 0 (i.e., There is no association between C-section rate and GDP)
\(H_a\) : \(b\) \(\neq\) 0 (i.e., There is an association between C-section rate and GDP)
tidy(lm)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.94 0.515 -7.66 2.18e-11
## 2 log_GDP 0.819 0.0706 11.6 1.72e-19
Interpretation: We found a p-value of p = 1.715801e-19. This is less than our cut-off value of \(\alpha\) = 5%. So we reject the null hypothesis that there is no association between C-section rate and GDP and have evidence to support the alternative hypothesis that there is an association between C-section rate and GDP.
Conclusion:
GDP does explain some of the variation we see in c-section rates across countries, and GDP and c-section rate have a positive association that is partially explained by the linear model. Another model with more variables might better fit the data. Of course, without any casual model, we are only able to find associations, however these associations may hold promise for further analysis and studies.
4. References
Boatin AA, Schlotheuber A, Betran AP, Moller AB, Barros AJ, Boerma T, Torloni MR, Victora CG, Hosseinpoor AR. Within country inequalities in caesarean section rates: observational study of 72 low and middle income countries. BMJ. 2018;24(360):k55. DOI: 10.1136/bmj.k55.
Gibbons L, Belizan JM, Lauer JA, Betran AP, Merialdi M, Althabe F. The global numbers and costs of additionally needed and unnecessary caesarean sections performed per year: overuse as a barrier to universal coverage. World Health Organization; 2010. Available at: http://www.who.int/healthsystems/ topics/financing/healthreport/30C-sectioncosts.pdf. Accessed November 24, 2018.
The World Bank. GDP per capita (current US$). Washington, DC; Available at: http://data.worldbank. org/indicator/NY.GDP.PCAP.CD. Accessed November 24, 2018.