The dataset used in this exercise is derived from the World Bank, one of the most widely used sources for internationally comparable development indicators. It combines demographic, economic, and social variables at the country level, allowing for cross-national analyses of global development patterns. Data refer to the year 2018.
Each observation represents a country and includes the following variables:
country: Country nameincome: Income group classification (e.g. low, middle,
high income)femwork: Share of women employed in the
non-agricultural sector (% of total non-agricultural employment)gni: Gross national income per capita (current US$),
used as a proxy for economic wealthlifeexp: Life expectancy at birth (in years), a key
indicator of population healthchildmort: Under-5 mortality rate (per 1,000 live
births), reflecting early-life health conditionspopgrowth: Annual population growth rate (in %)Together, these variables capture central dimensions of development, including economic capacity, health outcomes, demographic dynamics, and gender-related labor market participation. The dataset is particularly suitable for illustrating statistical relationships in real-world, non-ideal data, where issues such as non-linearity, heteroscedasticity, and unequal distributions are common.
Start with a theoretical assumption about an association (can be proven by statistics):
The higher the share of women in non agricultural workforce, the slower populations will grow.
This assumpution is based on the following line of argumentation: Economic development > improved family economics > female empowerment > increasing age at first birth > decreasing fertility rate > decreasing population growth.
Note that the underlying causal chain cannot be established using statistical analysis alone. However, we can provide empirical evidence to support our arguments.
Over the past few decades, there has been a notable increase in the participation of women in the non-agricultural workforce worldwide. This shift is primarily driven by factors such as changing social norms, improved education and employment opportunities for women, and the desire for economic empowerment.
There is often a positive correlation between the proportion of women in the non-agricultural workforce and a country’s level of economic development. As economies progress, they tend to undergo a structural transformation, shifting from primarily agricultural-based to more diversified sectors such as manufacturing, services, and technology. This transition provides increased opportunities for women to engage in non-agricultural employment.
As more women participate in the non-agricultural workforce, there is evidence to suggest a decline in fertility rates. Women who are engaged in paid employment often delay starting a family or choose to have fewer children due to the demands of their careers, the pursuit of education, or the desire for greater financial stability. Consequently, countries with higher female workforce participation rates generally experience lower population growth rates. The increased inclusion of women in the non-agricultural workforce is often linked to improved access to education and greater gender equality. When women have access to quality education and are empowered to participate in the labor market, they tend to make informed choices about family planning, leading to lower population growth.
It is essential to recognize that cultural norms, social structures, government policies, and economic factors can influence these dynamics differently. In some regions, where gender disparities are more pronounced or where traditional gender roles prevail, the relationship between female workforce participation and population growth may be less pronounced or exhibit different patterns.
We cannot prove these causalities by showing that the statistical association exists. At best we can strengthen our arguments. Vice versa, we can refute (at least weaken) our theoretical assumptions by showing that the statistical associations don’t exist.
Worldwide we observe a moderately positive fairly symmetrically distributed population growth of countries
summary(df$popgrowth)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## -3.2300 0.4275 1.2050 1.2659 2.1225 5.8300 1
hist(df$popgrowth)
The share of women in non agricultural sector is somewhat left skewed: many countries with high, some countries with lower shares, and many countries are close or even above gender parity (50%)
summary(df$femwork)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 6.20 31.40 40.70 37.96 47.50 55.80 38
hist(df$femwork)
There is a negative association: the higher the share of women in non agricultural workforce, the lower is the population growth, In the following plot, each dot represents a single country.
plot(df$femwork, df$popgrowth)
Identify outliers: countries with extreme labour inmigration (upper left) or extreme labor force outflow (lower left).
df$outlier = (df$popgrowth > 5) | (df$popgrowth < 0 & df$femwork < 20)
t = as.data.frame(table(df[df$outlier, "country"]))
names(t) = c("Country", "Frequency")
kable_styling(kable(t), full_width = FALSE)
| Country | Frequency |
|---|---|
| Oman | 1 |
| Syrian Arab Republic | 1 |
plot(df$femwork, df$popgrowth, col=ifelse(df$outlier, 'orange', 'darkblue'),
pch = ifelse(df$outlier, 16, 1),
main = "How population growth is associated with female empowerment",
xlab = "Share of female workforce in non agricultural sector (%)",
ylab = "Population growth (%)")
Pearson’s correlation coefficient \(r\) measures the strength of a linear association, ranging from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no association. Because missing values are present in the dataset, calculations must be restricted to complete observations.
# does not work due to NAs
cor(df$popgrowth, df$femwork)
## [1] NA
# tell R to restrict calculation to complete observations only
cor(df$popgrowth, df$femwork, use = "complete.obs")
## [1] -0.5054793
The observed \(r\) = -0.505 indicates a strong negative correlation, meaning that approximately 26% of the shared variation between population growth and female workforce is accounted for.
While Pearson’s \(r\) quantifies the strength of a relationship using the original metric values, Spearman’s \(r_s\) is computed from the ranks of the variables. Plotting the ranks shows that they are uniformly distributed, as each rank occurs exactly once. The scatterplot is ‘stretched’ and ‘smoothed’ in both directions, and clusters of values are resolved.
par(mfrow = c(2, 3))
plot(df$femwork, df$popgrowth,
main="Data used by Pearson's r",
col=ifelse(df$outlier, 'orange', 'darkblue'),
pch = ifelse(df$outlier, 16, 1))
hist(df$femwork)
hist(df$popgrowth)
df2 = df[!is.na(rowSums(df[, c("popgrowth", "femwork")])), ]
plot(rank(df2$femwork), rank(df2$popgrowth),
main="Data used by Spearman's r",
col=ifelse(df2$outlier, 'orange', 'darkblue'),
pch = ifelse(df2$outlier, 16, 1))
hist(rank(df$femwork))
hist(rank(df$popgrowth))
Comparing both concepts yields similar results for this specific relationship:
t = data.frame(Coefficient = c("Pearson", "Spearman"), Value = round(
c(
cor(df$popgrowth, df$femwork, use = "complete.obs", method = "pearson"),
cor(df$popgrowth, df$femwork, use = "complete.obs", method = "spearman")
), 3)
)
kable_styling(kable(t), full_width = FALSE)
| Coefficient | Value |
|---|---|
| Pearson | -0.505 |
| Spearman | -0.529 |
To describe the shape of the association, a linear regression line is estimated to capture the average relationship. In this analysis, female workforce is treated as the explanatory variable driving population growth (the dependent variable). Estimating the linear model yields a regression constant (intercept) and a regression coefficient.
lm(df$popgrowth ~ df$femwork)
##
## Call:
## lm(formula = df$popgrowth ~ df$femwork)
##
## Coefficients:
## (Intercept) df$femwork
## 3.00481 -0.04764
The estimated regression line is:
est. popgrowth = est. popgrowth = 3.005 - 0.048*femwork
.
The intercept suggests that with 0% female non-agricultural workforce, the estimated population growth is 3.0%. The regression coefficient implies that for every 10 percentage point increase in female workforce, population growth decreases by about 0.5 percentage points.
This relationship does not change significantly if outliers are excluded:
par(mfrow = c(1, 2))
ymin = min(df[df$outlier, "popgrowth"], na.rm=T)
ymax = max(df[df$outlier, "popgrowth"], na.rm=T)
title = paste("est. popgrowth =", "\n", mtxt1)
plot(df$femwork, df$popgrowth, ylim=c(ymin, ymax),
main=title,
col=ifelse(df$outlier, 'orange', 'darkblue'),
pch = ifelse(df$outlier, 16, 1))
abline(lm(df$popgrowth ~ df$femwork))
df2 = df[!df$outlier,]
title = paste("est. popgrowth =", "\n", mtxt2)
plot(df2$femwork, df2$popgrowth,
main=title,
ylim=c(ymin, ymax), col='darkblue')
abline(lm(df2$popgrowth ~ df2$femwork))
Regression models are used in two main contexts:
On average, population growth is estimated
est. popgrowth = 3.04 - 0.049*femwork = 3.04 - 0.049**60 = 0.1%
for countries with a 60% share of women in non agricultural workforce.
xmin = min(df[df$outlier, "femwork"], na.rm=T)
ymin = min(df[df$outlier, "popgrowth"], na.rm=T)
ymax = max(df[df$outlier, "popgrowth"], na.rm=T)
title = "Population growth estimate for countries with a 60% share of women in non agricultural workforce."
df2 = df[!df$outlier,]
m=lm(df2$popgrowth ~ df2$femwork)
title = paste("est. popgrowth =", "\n", mtxt2)
plot(df2$femwork, df2$popgrowth,
main=title,
ylim=c(ymin, ymax), xlim=c(xmin, 62),
col='darkblue')
abline(m)
b = coefficients(m)
yest = b[1]+b[2]*60
# lx = c(60, 60, xmin)
# ly = c(ymin, yest, yest)
#lines(lx,ly)
abline(v=60, lty=3, col='orange')
abline(h=yest, lty=3, col='orange')
points(60, yest, bg='orange', col='white', pch=21, cex=1)
In both cases, model fit (e.g., explained variance, prediction error) and the validity of underlying assumptions are crucial.
Exploring the relationship between female participation in the non-agricultural workforce and child mortality is central to understanding broader development dynamics. Both indicators capture key dimensions of societal progress: child mortality reflects health conditions and access to basic services, while female labor force participation indicates economic inclusion and gender equality. Their association is theoretically meaningful, as improvements in child health can reduce caregiving burdens and enable greater economic participation of women, while increased female employment may, in turn, contribute to better health outcomes through higher household income and improved resource allocation. Analyzing this relationship therefore provides insight into how health and economic structures interact in the process of global development.
Accordingly, the hypothesis analysed in this section claims:
Countries with higher child mortality tend to exhibit lower female participation in the non-agricultural workforce.
We start again by describing our variables and visually exploring their relationships.
Child mortality is extremely right-skewed (L-shaped): most countries have low rates, while some countries have extremely high rates.
summary(df$childmort)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 2.10 7.70 16.60 30.20 46.75 132.50 30
hist(df$childmort)
As seen above, the share of women in the non-agricultural sector is somewhat left-skewed: many countries have high shares, while some countries have lower shares. Many countries are close to or even above gender parity (50%).
summary(df$femwork)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 6.20 31.40 40.70 37.96 47.50 55.80 38
hist(df$femwork)
A scatterplot reveals a negative association.
plot(df$childmort, df$femwork)
Regression analysis estimates female workforce participation as dependent of child mortality:
est. female workforce = 42.562 - 0.196*childmort
The reduction of child mortality by one child out of 1,000 implies an additional share of female non-agricultural workforce by 0.2 percentage points on average.
But, given the above scatterplot, how reliable is this basic interpretation? The scatterplot shows that the variation in child mortality is much larger among countries with low female workforce participation than among those with high participation. This pattern indicates heteroscedasticity, i.e. unequal variance of the outcome across the range of the predictor. As a result, simple linear models have limited reliability for inference and are particularly weak for prediction, since the uncertainty is not constant across observations. Substantively, this pattern also points to the presence of omitted variables. In this case, regional or cultural context may play a role: for example, countries in North Africa and the Middle East tend to have low female workforce participation but relatively low child mortality, whereas many countries in Sub-Saharan Africa combine low participation with high child mortality. This suggests that the observed association is at least partly driven by underlying structural differences between groups of countries.
cols = c("darkorange", "darkred", "lightsteelblue")
plot(df$childmort, df$femwork,
pch = 19,
col = cols[df$region2])
abline(lm(df$femwork ~ df$childmort))
legend("bottomright",
legend = levels(df$region2),
pch = 19,
col = cols)
Looking at the differences between Pearson’s \(r\) and Spearmans’s \(r_s\) reveals, that \(r_s\) is much greater than \(r\). Obviously, Pearson’s \(r\) does not cope well with with ‘less then perfect’ associations or unequal variances.
par(mfrow = c(2, 3))
plot(df$childmort, df$femwork,
main="Data used by Pearson's r",
col='darkblue',
pch = 1)
hist(df$childmort)
hist(df$femwork)
df2 = df[!is.na(rowSums(df[, c("childmort", "femwork")])), ]
plot(rank(df2$childmort), rank(df2$femwork),
main="Data used by Spearman's r",
col='darkblue',
pch = 1)
hist(rank(df2$childmort))
hist(rank(df2$femwork))
t = data.frame(Coefficient = c("Pearson", "Spearman"), Value = round(
c(
cor(df$childmort, df$femwork, use = "complete.obs", method = "pearson"),
cor(df$childmort, df$femwork, use = "complete.obs", method = "spearman")
), 3)
)
kable_styling(kable(t), full_width = FALSE)
| Coefficient | Value |
|---|---|
| Pearson | -0.442 |
| Spearman | -0.585 |
A central question in global development is how economic resources
translate into population health. Gross national income per capita
(gni) is commonly used as a proxy for a country’s economic
capacity, while life expectancy (lifeexp) captures overall
health outcomes. Examining the relationship between these variables
helps to understand whether—and to what extent—greater economic power is
associated with longer lives.
Which impact does the economic power of countries (gni)
have on the life expectancy of their populations (lifeexp)?
This is one of the most prominent questions in global development
research in recent decades. Our hypothesis for this section is
A country’s economic power is positively associated with the life expectancy of its population.
summary(df$lifeexp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 48.93 65.91 73.51 71.43 77.47 83.98 18
hist(df$lifeexp)
summary(df$gni)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 271.2 1558.0 5562.2 16456.1 20305.4 172676.3 22
hist(df$gni)
plot(df$gni, df$lifeexp)
The distribution of lifeexp is moderately left skewed.
While a majority of populations show life expectancies of 70 years and
above, some countries exist with life expectancies below 60 (and even
below 50) years. In contrast, gni shows a L-shaped
distribution indicating that more than three quarters of the countries
world wide have gni values below 20.000 US$ while the last quarter
covers a range from 20.000 to over 170.000 US$ per capita.
The scatterplot reveals a clear and strong, yet non-linear relationship: life expectancy increases steeply at lower levels of economic development, particularly among low-income countries. However, this increase gradually attenuates as economic prosperity rises. This pattern reflects a classic plateau effect, where marginal gains in life expectancy become progressively smaller at higher income levels, indicating diminishing returns to additional economic resources in already affluent contexts. ions.
t = data.frame(Coefficient = c("Pearson", "Spearman"), Value = round(
c(
cor(df$gni, df$lifeexp, use = "complete.obs", method = "pearson"),
cor(df$gni, df$lifeexp, use = "complete.obs", method = "spearman")
), 3)
)
kable_styling(kable(t), full_width = FALSE)
| Coefficient | Value |
|---|---|
| Pearson | 0.583 |
| Spearman | 0.847 |
Interpretation: While Pearson’s \(r\) implies that 0.5832 \(\approx\) 34% of the variance is shared, Spearman’s \(r_s\) implies that 0.8472 \(\approx\) 72% of the variance in the ranks is shared.
The rank coefficient \(r_s\) is substantially larger than Pearson’s \(r\). Pearson’s \(r\) captures only the linear component of the association and treats any non-linear structure as unexplained variance; accordingly, it does not adequately represent non-linear relationships. In contrast, Spearman’s \(r_s\) is sensitive to monotonicity, capturing consistent increasing or decreasing trends even when the relationship is not linear. As shown above, \(r_s\) ‘stretches’ the association by operating on ranks. In other words, using ranked variables effectively linearizes a strong monotonic, but non-linear, relationship.
par(mfrow = c(2, 3))
plot(df$gni, df$lifeexp,
main="Data used by Pearson's r",
col='darkblue',
pch = 1)
hist(df$gni)
hist(df$lifeexp)
df2 = df[!is.na(rowSums(df[, c("gni", "lifeexp")])), ]
plot(rank(df2$gni), rank(df2$lifeexp),
main="Data used by Spearman's r",
col='darkblue',
pch = 1)
hist(rank(df2$gni))
hist(rank(df2$lifeexp))
It is evident that the association cannot be adequately modeled by a straight line:
lm(df$lifeexp ~ df$gni)
##
## Call:
## lm(formula = df$lifeexp ~ df$gni)
##
## Coefficients:
## (Intercept) df$gni
## 6.821e+01 2.152e-04
plot(df$gni, df$lifeexp)
abline(lm(df$lifeexp ~ df$gni))
The model output shows that an increase of gni by 10.000
US$ gains 2.1 additional life years on average. But this description
does not reflect the non-liner shape. Alternatively, we could express
the relationship in terms of ranks:
lm(rank(df$lifeexp) ~ rank(df$gni))
##
## Call:
## lm(formula = rank(df$lifeexp) ~ rank(df$gni))
##
## Coefficients:
## (Intercept) rank(df$gni)
## 31.3550 0.7123
plot(rank(df$gni), rank(df$lifeexp))
abline(lm(rank(df$lifeexp) ~ rank(df$gni)))
According to this model, the rank of lifeexp increases
on average by 0.7 if a country succeeds in moving up one rank in
gni. While this is statistically valid, it does not help us
understand the functional form of the relationship. As with rank
transformations, we therefore seek a way to linearize the association so
that we can meaningfully quantify its slope by estimating an appropriate
regression coefficient.
The linear model does not reflect the non-linear structure of the data. In order to deal with non-linear associations, we have two options:
As we prefer to continue working with linear regression models, we choose the second option. A closer look at the non-linear association reveals the non-linear nature of he association in more detail.
### NON-LINEAR REGRESSION
plot(df$gni, df$lifeexp)
abline(v=1000, lty=3, col='orange')
abline(v=10000, lty=3, col='orange')
abline(v=100000, lty=3, col='orange')
abline(h=64, lty=3, col='orange')
abline(h=74, lty=3, col='orange')
abline(h=84, lty=3, col='orange')
points(1000, 64,bg='orange', col='white', pch=21, cex=1.5)
points(10000, 74, bg='orange', col='white', pch=21, cex=1.5)
points(100000, 84, bg='orange', col='white', pch=21, cex=1.5)
It becomes evident that while lifeexp grows linearly (by
adding 10), gni grows geometrically (by multiplying with
10).
# x,y = coordinates; cex = size of points; pch = shape of points; col = border colour; bg = background colour
# we describe our association by these three points:
t = data.frame(gni = c("1.000", "*10", "10.000", "*10", "100.000"),
lifeexp = c(64, "+10", 74, "+10", 84))
kable_styling(kable(t, align = "rr"), full_width = FALSE)
| gni | lifeexp |
|---|---|
| 1.000 | 64 |
| *10 | +10 |
| 10.000 | 74 |
| *10 | +10 |
| 100.000 | 84 |
# gni | MULTIPLY gni by | lifeexp | ADD to lifeexp
# ---------|-----------------|---------|---------------
# 1000 | | 64 |
# | * 10 | | + 10
# 10000 | | 74 |
# | * 10 | | + 10
# 100000 | | 84 |
# The non-linearity of the association is due to the fact, that lifeexp
# grows arithmetically (+) while gni grows geometrically (*). We can transform
# the change of gni into linear growth by looking at the power of 10:
That said, we can transform the change of gni into
linear growth by looking at the power of 10:
t = data.frame(gni = c("1.000 = 10^3^", "*10", "10.000 = 10^4^", "*10", "100.000 = 10^5^"),
exponent = c("3", "+1", "4", "+1", "5"),
lifeexp = c(64, "+10", 74, "+10", 84))
names(t)[2] = "exponent of gni"
kable_styling(kable(t, align = "rr"), full_width = FALSE)
| gni | exponent of gni | lifeexp |
|---|---|---|
| 1.000 = 103 | 3 | 64 |
| *10 | +1 | +10 |
| 10.000 = 104 | 4 | 74 |
| *10 | +1 | +10 |
| 100.000 = 105 | 5 | 84 |
# gni | exponent | ADD to exponent | lifeexp | ADD to lifeexp
# 1000 = 10^3 | 3 | | 64 |
# | | + 1 | | + 10
# 10000 = 10^4 | 4 | | 74 |
# | | + 1 | | + 10
# 100000 = 10^5 | 5 | | 84 |
In general: if we include the exponents of gni to base
10 into our model instead of gni itself we have an
approximately linear relationship. To achieve this, we calculate the
common logarithms of gni. The x-axis now shows the expoents (3, 4, 5)
instead of untransformed values (1.000, 10.000, 100.000), and the
association is approximately linear.
# association with common logarithm (base 10) is close to linear now
plot(log10(df$gni), df$lifeexp)
abline(v=3, lty=3, col='orange')
abline(v=4, lty=3, col='orange')
abline(v=5, lty=3, col='orange')
abline(h=64, lty=3, col='orange')
abline(h=74, lty=3, col='orange')
abline(h=84, lty=3, col='orange')
points(3, 64,bg='orange', col='white', pch=21, cex=1.5)
points(4, 74, bg='orange', col='white', pch=21, cex=1.5)
points(5, 84, bg='orange', col='white', pch=21, cex=1.5)
par(mfrow = c(3, 3))
plot(df$gni, df$lifeexp,
main="Data used by Pearson's r",
col='darkblue',
pch = 1)
hist(df$gni)
hist(df$lifeexp)
df2 = df[!is.na(rowSums(df[, c("gni", "lifeexp")])), ]
plot(rank(df2$gni), rank(df2$lifeexp),
main="Data used by Spearman's r",
col='darkblue',
pch = 1)
hist(rank(df2$gni))
hist(rank(df2$lifeexp))
df2 = df[!is.na(rowSums(df[, c("gni", "lifeexp")])), ]
plot(log10(df2$gni), df2$lifeexp,
main="Data used by Pearson's r\nwith log(gni)",
col='darkblue',
pch = 1)
hist(log10(df2$gni))
hist(df2$lifeexp)
The linear nature achieved by logarithmic transformation is also reflected by Pearson’s correlation coefficients.
t = data.frame(Coefficient = c("Pearson", "Spearman", "Pearson", "Spearman"), Value = round(
c(
cor(df$gni, df$lifeexp, use = "complete.obs", method = "pearson"),
cor(df$gni, df$lifeexp, use = "complete.obs", method = "spearman"),
cor(log10(df$gni), df$lifeexp, use = "complete.obs", method = "pearson"),
cor(log10(df$gni), df$lifeexp, use = "complete.obs", method = "spearman")
), 3),
Variables = c("gni x lifeexp", "gni x lifeexp", "log(gni) x lifeexp", "log(gni) x lifeexp")
)
kable_styling(kable(t), full_width = FALSE)
| Coefficient | Value | Variables |
|---|---|---|
| Pearson | 0.583 | gni x lifeexp |
| Spearman | 0.847 | gni x lifeexp |
| Pearson | 0.800 | log(gni) x lifeexp |
| Spearman | 0.847 | log(gni) x lifeexp |
After transformation, both correlation coefficients reflect the clear and strong relationship. Note that \(r_s\) did not change in comparison to the linear model, as the ordinal information (the order of countries by gni and lifeexp, respectively) did not change by logarithm.
lm(df$lifeexp ~ log10(df$gni))
##
## Call:
## lm(formula = df$lifeexp ~ log10(df$gni))
##
## Coefficients:
## (Intercept) log10(df$gni)
## 33.58 10.12
plot(log10(df$gni), df$lifeexp)
abline(lm(df$lifeexp ~ log10(df$gni)))
The model estimate after transformation is
est. lifeexp = 33.6 + 10.1 * log10(gni)
An increase of log(gni) by 1 results in 10 additional life years, i.e. a tenfold increase of gni per capita gains 10 additional life years in the population.
In substantive terms, this implies a strong but diminishing-return relationship: proportional (tenfold) increases in income are associated with substantial gains in life expectancy, highlighting that improvements in economic capacity translate into meaningful, but not linear, health benefits. Note, however, that the scatterplot shows a slight ammount of heteroscedasticity with poorer countries varying more than richer countries.