The dataset used in this exercise is derived from the World Bank, one of the most widely used sources for internationally comparable development indicators. It combines demographic, economic, and social variables at the country level, allowing for cross-national analyses of global development patterns. Data refer to the year 2018.

Each observation represents a country and includes the following variables:

Together, these variables capture central dimensions of development, including economic capacity, health outcomes, demographic dynamics, and gender-related labor market participation. The dataset is particularly suitable for illustrating statistical relationships in real-world, non-ideal data, where issues such as non-linearity, heteroscedasticity, and unequal distributions are common.

1 Female workforce and population growth

Start with a theoretical assumption about an association (can be proven by statistics):

The higher the share of women in non agricultural workforce, the slower populations will grow.

This assumpution is based on the following line of argumentation: Economic development > improved family economics > female empowerment > increasing age at first birth > decreasing fertility rate > decreasing population growth.

Note that the underlying causal chain cannot be established using statistical analysis alone. However, we can provide empirical evidence to support our arguments.

Over the past few decades, there has been a notable increase in the participation of women in the non-agricultural workforce worldwide. This shift is primarily driven by factors such as changing social norms, improved education and employment opportunities for women, and the desire for economic empowerment.

There is often a positive correlation between the proportion of women in the non-agricultural workforce and a country’s level of economic development. As economies progress, they tend to undergo a structural transformation, shifting from primarily agricultural-based to more diversified sectors such as manufacturing, services, and technology. This transition provides increased opportunities for women to engage in non-agricultural employment.

As more women participate in the non-agricultural workforce, there is evidence to suggest a decline in fertility rates. Women who are engaged in paid employment often delay starting a family or choose to have fewer children due to the demands of their careers, the pursuit of education, or the desire for greater financial stability. Consequently, countries with higher female workforce participation rates generally experience lower population growth rates. The increased inclusion of women in the non-agricultural workforce is often linked to improved access to education and greater gender equality. When women have access to quality education and are empowered to participate in the labor market, they tend to make informed choices about family planning, leading to lower population growth.

It is essential to recognize that cultural norms, social structures, government policies, and economic factors can influence these dynamics differently. In some regions, where gender disparities are more pronounced or where traditional gender roles prevail, the relationship between female workforce participation and population growth may be less pronounced or exhibit different patterns.

We cannot prove these causalities by showing that the statistical association exists. At best we can strengthen our arguments. Vice versa, we can refute (at least weaken) our theoretical assumptions by showing that the statistical associations don’t exist.

Descriptive statistics

Worldwide we observe a moderately positive fairly symmetrically distributed population growth of countries

summary(df$popgrowth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
## -3.2300  0.4275  1.2050  1.2659  2.1225  5.8300       1
hist(df$popgrowth)

The share of women in non agricultural sector is somewhat left skewed: many countries with high, some countries with lower shares, and many countries are close or even above gender parity (50%)

summary(df$femwork)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##    6.20   31.40   40.70   37.96   47.50   55.80      38
hist(df$femwork)

There is a negative association: the higher the share of women in non agricultural workforce, the lower is the population growth, In the following plot, each dot represents a single country.

plot(df$femwork, df$popgrowth)

Identify outliers: countries with extreme labour inmigration (upper left) or extreme labor force outflow (lower left).

df$outlier = (df$popgrowth > 5) | (df$popgrowth < 0 & df$femwork < 20)
t = as.data.frame(table(df[df$outlier, "country"]))
names(t) = c("Country", "Frequency")
kable_styling(kable(t), full_width = FALSE)
Country Frequency
Oman 1
Syrian Arab Republic 1
plot(df$femwork, df$popgrowth, col=ifelse(df$outlier, 'orange', 'darkblue'), 
     pch = ifelse(df$outlier, 16, 1), 
     main = "How population growth is associated with female empowerment", 
     xlab = "Share of female workforce in non agricultural sector (%)", 
     ylab = "Population growth (%)")

Correlation Analysis

Pearson’s correlation coefficient \(r\) measures the strength of a linear association, ranging from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no association. Because missing values are present in the dataset, calculations must be restricted to complete observations.

# does not work due to NAs
cor(df$popgrowth, df$femwork)
## [1] NA
# tell R to restrict calculation to complete observations only
cor(df$popgrowth, df$femwork, use = "complete.obs")
## [1] -0.5054793

The observed \(r\) = -0.505 indicates a strong negative correlation, meaning that approximately 26% of the shared variation between population growth and female workforce is accounted for.

While Pearson’s \(r\) quantifies the strength of a relationship using the original metric values, Spearman’s \(r_s\) is computed from the ranks of the variables. Plotting the ranks shows that they are uniformly distributed, as each rank occurs exactly once. The scatterplot is ‘stretched’ and ‘smoothed’ in both directions, and clusters of values are resolved.

par(mfrow = c(2, 3))
plot(df$femwork, df$popgrowth, 
     main="Data used by Pearson's r",
     col=ifelse(df$outlier, 'orange', 'darkblue'), 
     pch = ifelse(df$outlier, 16, 1))
hist(df$femwork)
hist(df$popgrowth)
df2 = df[!is.na(rowSums(df[, c("popgrowth", "femwork")])), ]
plot(rank(df2$femwork), rank(df2$popgrowth), 
     main="Data used by Spearman's r",
     col=ifelse(df2$outlier, 'orange', 'darkblue'), 
     pch = ifelse(df2$outlier, 16, 1))
hist(rank(df$femwork))
hist(rank(df$popgrowth))

Comparing both concepts yields similar results for this specific relationship:

t = data.frame(Coefficient = c("Pearson", "Spearman"), Value = round(
  c(
    cor(df$popgrowth, df$femwork, use = "complete.obs", method = "pearson"),
    cor(df$popgrowth, df$femwork, use = "complete.obs", method = "spearman")
    ), 3)
)
kable_styling(kable(t), full_width = FALSE)
Coefficient Value
Pearson -0.505
Spearman -0.529

Linear Regression Analysis

To describe the shape of the association, a linear regression line is estimated to capture the average relationship. In this analysis, female workforce is treated as the explanatory variable driving population growth (the dependent variable). Estimating the linear model yields a regression constant (intercept) and a regression coefficient.

lm(df$popgrowth ~ df$femwork)
## 
## Call:
## lm(formula = df$popgrowth ~ df$femwork)
## 
## Coefficients:
## (Intercept)   df$femwork  
##     3.00481     -0.04764

The estimated regression line is:

est. popgrowth = est. popgrowth = 3.005 - 0.048*femwork .

The intercept suggests that with 0% female non-agricultural workforce, the estimated population growth is 3.0%. The regression coefficient implies that for every 10 percentage point increase in female workforce, population growth decreases by about 0.5 percentage points.

This relationship does not change significantly if outliers are excluded:

par(mfrow = c(1, 2))
ymin = min(df[df$outlier, "popgrowth"], na.rm=T)
ymax = max(df[df$outlier, "popgrowth"], na.rm=T)
title = paste("est. popgrowth =", "\n", mtxt1)
plot(df$femwork, df$popgrowth, ylim=c(ymin, ymax),
     main=title,
     col=ifelse(df$outlier, 'orange', 'darkblue'), 
     pch = ifelse(df$outlier, 16, 1))
abline(lm(df$popgrowth ~ df$femwork))

df2 = df[!df$outlier,]
title = paste("est. popgrowth =", "\n", mtxt2)
plot(df2$femwork, df2$popgrowth, 
     main=title,
     ylim=c(ymin, ymax), col='darkblue')
abline(lm(df2$popgrowth ~ df2$femwork))

Regression models are used in two main contexts:

  • Hypothesis testing: They allow us to empirically assess theoretically derived relationships between variables by interpreting regression coefficients as conditional effects and evaluating them using statistical inference. In the example above, the regression coefficient, i.e., the slope of the regression line, provides evidence for the existence of a negative effect of female workforce on population growth.
  • Prediction: They are used to forecast outcomes based on observed predictors; here, the focus is less on causal interpretation and more on predictive accuracy. With regard to the example we mioght ask for the expected population growth in pcouintries with a female workforce participation as high as 60%. To answer this question, we simply insert the \(x\)-value into our equation:

On average, population growth is estimated

est. popgrowth = 3.04 - 0.049*femwork = 3.04 - 0.049**60 = 0.1%

for countries with a 60% share of women in non agricultural workforce.

xmin = min(df[df$outlier, "femwork"], na.rm=T)
ymin = min(df[df$outlier, "popgrowth"], na.rm=T)
ymax = max(df[df$outlier, "popgrowth"], na.rm=T)
title = "Population growth estimate for countries with a 60% share of women in non agricultural workforce."

df2 = df[!df$outlier,]
m=lm(df2$popgrowth ~ df2$femwork)
title = paste("est. popgrowth =", "\n", mtxt2)
plot(df2$femwork, df2$popgrowth, 
     main=title,
     ylim=c(ymin, ymax), xlim=c(xmin, 62),
     col='darkblue')
abline(m)
b = coefficients(m)
yest = b[1]+b[2]*60
# lx = c(60, 60, xmin)
# ly = c(ymin, yest, yest)
#lines(lx,ly)
abline(v=60, lty=3, col='orange')
abline(h=yest, lty=3, col='orange')
points(60, yest, bg='orange', col='white', pch=21, cex=1)

In both cases, model fit (e.g., explained variance, prediction error) and the validity of underlying assumptions are crucial.

2 Female workforce and child mortality

Exploring the relationship between female participation in the non-agricultural workforce and child mortality is central to understanding broader development dynamics. Both indicators capture key dimensions of societal progress: child mortality reflects health conditions and access to basic services, while female labor force participation indicates economic inclusion and gender equality. Their association is theoretically meaningful, as improvements in child health can reduce caregiving burdens and enable greater economic participation of women, while increased female employment may, in turn, contribute to better health outcomes through higher household income and improved resource allocation. Analyzing this relationship therefore provides insight into how health and economic structures interact in the process of global development.

Accordingly, the hypothesis analysed in this section claims:

Countries with higher child mortality tend to exhibit lower female participation in the non-agricultural workforce.

Descriptive statistics

We start again by describing our variables and visually exploring their relationships.

Child mortality is extremely right-skewed (L-shaped): most countries have low rates, while some countries have extremely high rates.

summary(df$childmort)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##    2.10    7.70   16.60   30.20   46.75  132.50      30
hist(df$childmort)

As seen above, the share of women in the non-agricultural sector is somewhat left-skewed: many countries have high shares, while some countries have lower shares. Many countries are close to or even above gender parity (50%).

summary(df$femwork)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##    6.20   31.40   40.70   37.96   47.50   55.80      38
hist(df$femwork)

A scatterplot reveals a negative association.

plot(df$childmort, df$femwork)

Linear regression analysis

Regression analysis estimates female workforce participation as dependent of child mortality:

est. female workforce = 42.562 - 0.196*childmort

The reduction of child mortality by one child out of 1,000 implies an additional share of female non-agricultural workforce by 0.2 percentage points on average.

But, given the above scatterplot, how reliable is this basic interpretation? The scatterplot shows that the variation in child mortality is much larger among countries with low female workforce participation than among those with high participation. This pattern indicates heteroscedasticity, i.e. unequal variance of the outcome across the range of the predictor. As a result, simple linear models have limited reliability for inference and are particularly weak for prediction, since the uncertainty is not constant across observations. Substantively, this pattern also points to the presence of omitted variables. In this case, regional or cultural context may play a role: for example, countries in North Africa and the Middle East tend to have low female workforce participation but relatively low child mortality, whereas many countries in Sub-Saharan Africa combine low participation with high child mortality. This suggests that the observed association is at least partly driven by underlying structural differences between groups of countries.

cols = c("darkorange", "darkred", "lightsteelblue")

plot(df$childmort, df$femwork,
     pch = 19,
     col = cols[df$region2])

abline(lm(df$femwork ~ df$childmort))

legend("bottomright",
       legend = levels(df$region2),
       pch = 19,
       col = cols)

Correlation analysis

Looking at the differences between Pearson’s \(r\) and Spearmans’s \(r_s\) reveals, that \(r_s\) is much greater than \(r\). Obviously, Pearson’s \(r\) does not cope well with with ‘less then perfect’ associations or unequal variances.

par(mfrow = c(2, 3))
plot(df$childmort, df$femwork, 
     main="Data used by Pearson's r",
     col='darkblue', 
     pch = 1)
hist(df$childmort)
hist(df$femwork)
df2 = df[!is.na(rowSums(df[, c("childmort", "femwork")])), ]

plot(rank(df2$childmort), rank(df2$femwork), 
     main="Data used by Spearman's r",
     col='darkblue', 
     pch = 1)
hist(rank(df2$childmort))
hist(rank(df2$femwork))

t = data.frame(Coefficient = c("Pearson", "Spearman"), Value = round(
  c(
    cor(df$childmort, df$femwork, use = "complete.obs", method = "pearson"),
    cor(df$childmort, df$femwork, use = "complete.obs", method = "spearman")
    ), 3)
)
kable_styling(kable(t), full_width = FALSE)
Coefficient Value
Pearson -0.442
Spearman -0.585

3 Economic wealth and life expectancy

A central question in global development is how economic resources translate into population health. Gross national income per capita (gni) is commonly used as a proxy for a country’s economic capacity, while life expectancy (lifeexp) captures overall health outcomes. Examining the relationship between these variables helps to understand whether—and to what extent—greater economic power is associated with longer lives.

Which impact does the economic power of countries (gni) have on the life expectancy of their populations (lifeexp)? This is one of the most prominent questions in global development research in recent decades. Our hypothesis for this section is

A country’s economic power is positively associated with the life expectancy of its population.

Descriptive Statistics

summary(df$lifeexp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##   48.93   65.91   73.51   71.43   77.47   83.98      18
hist(df$lifeexp)

summary(df$gni)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.      NAs 
##    271.2   1558.0   5562.2  16456.1  20305.4 172676.3       22
hist(df$gni)

plot(df$gni, df$lifeexp)

The distribution of lifeexp is moderately left skewed. While a majority of populations show life expectancies of 70 years and above, some countries exist with life expectancies below 60 (and even below 50) years. In contrast, gni shows a L-shaped distribution indicating that more than three quarters of the countries world wide have gni values below 20.000 US$ while the last quarter covers a range from 20.000 to over 170.000 US$ per capita.

Correlation analysis

The scatterplot reveals a clear and strong, yet non-linear relationship: life expectancy increases steeply at lower levels of economic development, particularly among low-income countries. However, this increase gradually attenuates as economic prosperity rises. This pattern reflects a classic plateau effect, where marginal gains in life expectancy become progressively smaller at higher income levels, indicating diminishing returns to additional economic resources in already affluent contexts. ions.

t = data.frame(Coefficient = c("Pearson", "Spearman"), Value = round(
  c(
    cor(df$gni, df$lifeexp, use = "complete.obs", method = "pearson"),
    cor(df$gni, df$lifeexp, use = "complete.obs", method = "spearman")
    ), 3)
)
kable_styling(kable(t), full_width = FALSE)
Coefficient Value
Pearson 0.583
Spearman 0.847

Interpretation: While Pearson’s \(r\) implies that 0.5832 \(\approx\) 34% of the variance is shared, Spearman’s \(r_s\) implies that 0.8472 \(\approx\) 72% of the variance in the ranks is shared.

The rank coefficient \(r_s\) is substantially larger than Pearson’s \(r\). Pearson’s \(r\) captures only the linear component of the association and treats any non-linear structure as unexplained variance; accordingly, it does not adequately represent non-linear relationships. In contrast, Spearman’s \(r_s\) is sensitive to monotonicity, capturing consistent increasing or decreasing trends even when the relationship is not linear. As shown above, \(r_s\) ‘stretches’ the association by operating on ranks. In other words, using ranked variables effectively linearizes a strong monotonic, but non-linear, relationship.

par(mfrow = c(2, 3))
plot(df$gni, df$lifeexp, 
     main="Data used by Pearson's r",
     col='darkblue', 
     pch = 1)
hist(df$gni)
hist(df$lifeexp)

df2 = df[!is.na(rowSums(df[, c("gni", "lifeexp")])), ]
plot(rank(df2$gni), rank(df2$lifeexp), 
     main="Data used by Spearman's r",
     col='darkblue', 
     pch = 1)
hist(rank(df2$gni))
hist(rank(df2$lifeexp))

Linear regression analysis

It is evident that the association cannot be adequately modeled by a straight line:

lm(df$lifeexp ~ df$gni)
## 
## Call:
## lm(formula = df$lifeexp ~ df$gni)
## 
## Coefficients:
## (Intercept)       df$gni  
##   6.821e+01    2.152e-04
plot(df$gni, df$lifeexp)
abline(lm(df$lifeexp ~ df$gni))

The model output shows that an increase of gni by 10.000 US$ gains 2.1 additional life years on average. But this description does not reflect the non-liner shape. Alternatively, we could express the relationship in terms of ranks:

lm(rank(df$lifeexp) ~ rank(df$gni))
## 
## Call:
## lm(formula = rank(df$lifeexp) ~ rank(df$gni))
## 
## Coefficients:
##  (Intercept)  rank(df$gni)  
##      31.3550        0.7123
plot(rank(df$gni), rank(df$lifeexp))
abline(lm(rank(df$lifeexp) ~ rank(df$gni)))

According to this model, the rank of lifeexp increases on average by 0.7 if a country succeeds in moving up one rank in gni. While this is statistically valid, it does not help us understand the functional form of the relationship. As with rank transformations, we therefore seek a way to linearize the association so that we can meaningfully quantify its slope by estimating an appropriate regression coefficient.

Non-linear regression analysis (Transformation)

The linear model does not reflect the non-linear structure of the data. In order to deal with non-linear associations, we have two options:

  • choose a non-linear model function
  • transform our data to linearize the association

As we prefer to continue working with linear regression models, we choose the second option. A closer look at the non-linear association reveals the non-linear nature of he association in more detail.

### NON-LINEAR REGRESSION

plot(df$gni, df$lifeexp)

abline(v=1000, lty=3, col='orange')
abline(v=10000, lty=3, col='orange')
abline(v=100000, lty=3, col='orange')
abline(h=64, lty=3, col='orange')
abline(h=74, lty=3, col='orange')
abline(h=84, lty=3, col='orange')
points(1000, 64,bg='orange', col='white', pch=21, cex=1.5)
points(10000, 74, bg='orange', col='white', pch=21, cex=1.5)
points(100000, 84, bg='orange', col='white', pch=21, cex=1.5)

It becomes evident that while lifeexp grows linearly (by adding 10), gni grows geometrically (by multiplying with 10).

# x,y = coordinates; cex = size of points; pch = shape of points; col = border colour; bg = background colour
# we describe our association by these three points:

t = data.frame(gni = c("1.000", "*10", "10.000", "*10", "100.000"),
           lifeexp = c(64, "+10", 74, "+10", 84))
kable_styling(kable(t, align = "rr"), full_width = FALSE)
gni lifeexp
1.000 64
*10 +10
10.000 74
*10 +10
100.000 84
#      gni | MULTIPLY gni by | lifeexp | ADD to lifeexp 
# ---------|-----------------|---------|---------------
#     1000 |                 |      64 |            
#          |            * 10 |         |           + 10
#    10000 |                 |      74 |            
#          |            * 10 |         |           + 10
#   100000 |                 |      84 |            

# The non-linearity of the association is due to the fact, that lifeexp 
# grows arithmetically (+) while gni grows geometrically (*). We can transform 
# the change of gni into linear growth by looking at the power of 10:

That said, we can transform the change of gni into linear growth by looking at the power of 10:

t = data.frame(gni = c("1.000 = 10^3^", "*10", "10.000 = 10^4^", "*10", "100.000 = 10^5^"),
               exponent = c("3", "+1", "4", "+1", "5"),
           lifeexp = c(64, "+10", 74, "+10", 84))
names(t)[2] = "exponent of gni"
kable_styling(kable(t, align = "rr"), full_width = FALSE)
gni exponent of gni lifeexp
1.000 = 103 3 64
*10 +1 +10
10.000 = 104 4 74
*10 +1 +10
100.000 = 105 5 84
#            gni |  exponent | ADD to exponent | lifeexp | ADD to lifeexp 
#    1000 = 10^3 |         3 |                 |      64 |            
#                |           |             + 1 |         |           + 10
#   10000 = 10^4 |         4 |                 |      74 |            
#                |           |             + 1 |         |           + 10
#  100000 = 10^5 |         5 |                 |      84 |

In general: if we include the exponents of gni to base 10 into our model instead of gni itself we have an approximately linear relationship. To achieve this, we calculate the common logarithms of gni. The x-axis now shows the expoents (3, 4, 5) instead of untransformed values (1.000, 10.000, 100.000), and the association is approximately linear.

# association with common logarithm (base 10) is close to linear now
plot(log10(df$gni), df$lifeexp)

abline(v=3, lty=3, col='orange')
abline(v=4, lty=3, col='orange')
abline(v=5, lty=3, col='orange')
abline(h=64, lty=3, col='orange')
abline(h=74, lty=3, col='orange')
abline(h=84, lty=3, col='orange')
points(3, 64,bg='orange', col='white', pch=21, cex=1.5)
points(4, 74, bg='orange', col='white', pch=21, cex=1.5)
points(5, 84, bg='orange', col='white', pch=21, cex=1.5)

par(mfrow = c(3, 3))
plot(df$gni, df$lifeexp, 
     main="Data used by Pearson's r",
     col='darkblue', 
     pch = 1)
hist(df$gni)
hist(df$lifeexp)

df2 = df[!is.na(rowSums(df[, c("gni", "lifeexp")])), ]
plot(rank(df2$gni), rank(df2$lifeexp), 
     main="Data used by Spearman's r",
     col='darkblue', 
     pch = 1)
hist(rank(df2$gni))
hist(rank(df2$lifeexp))

df2 = df[!is.na(rowSums(df[, c("gni", "lifeexp")])), ]
plot(log10(df2$gni), df2$lifeexp, 
     main="Data used by Pearson's r\nwith log(gni)",
     col='darkblue', 
     pch = 1)
hist(log10(df2$gni))
hist(df2$lifeexp)

The linear nature achieved by logarithmic transformation is also reflected by Pearson’s correlation coefficients.

t = data.frame(Coefficient = c("Pearson", "Spearman", "Pearson", "Spearman"), Value = round(
  c(
    cor(df$gni, df$lifeexp, use = "complete.obs", method = "pearson"),
    cor(df$gni, df$lifeexp, use = "complete.obs", method = "spearman"),
    cor(log10(df$gni), df$lifeexp, use = "complete.obs", method = "pearson"),
    cor(log10(df$gni), df$lifeexp, use = "complete.obs", method = "spearman")
    ), 3),
  Variables = c("gni x lifeexp", "gni x lifeexp", "log(gni) x lifeexp", "log(gni) x lifeexp")
)
kable_styling(kable(t), full_width = FALSE)
Coefficient Value Variables
Pearson 0.583 gni x lifeexp
Spearman 0.847 gni x lifeexp
Pearson 0.800 log(gni) x lifeexp
Spearman 0.847 log(gni) x lifeexp

After transformation, both correlation coefficients reflect the clear and strong relationship. Note that \(r_s\) did not change in comparison to the linear model, as the ordinal information (the order of countries by gni and lifeexp, respectively) did not change by logarithm.

lm(df$lifeexp ~ log10(df$gni))
## 
## Call:
## lm(formula = df$lifeexp ~ log10(df$gni))
## 
## Coefficients:
##   (Intercept)  log10(df$gni)  
##         33.58          10.12
plot(log10(df$gni), df$lifeexp)
abline(lm(df$lifeexp ~ log10(df$gni)))

The model estimate after transformation is

est. lifeexp = 33.6 + 10.1 * log10(gni)

An increase of log(gni) by 1 results in 10 additional life years, i.e. a tenfold increase of gni per capita gains 10 additional life years in the population.

In substantive terms, this implies a strong but diminishing-return relationship: proportional (tenfold) increases in income are associated with substantial gains in life expectancy, highlighting that improvements in economic capacity translate into meaningful, but not linear, health benefits. Note, however, that the scatterplot shows a slight ammount of heteroscedasticity with poorer countries varying more than richer countries.