Our group focused on Germany, Belgium, Ireland, and France because these four Western European nations represent contrasting trajectories in how income relates to health and demographic outcomes over 200 years (1825-2025). Germany exemplifies early industrialization with pioneering social welfare yet faced war disruptions. Belgium shows steady industrial development. Ireland uniquely transformed from extreme poverty and the Great Famine to rapid “Celtic Tiger” modernization. France demonstrates early demographic transition with progressive public health reforms.
By comparing these countries, we can test whether the global income-health relationship holds consistently across different European development models, and whether Western Europe’s demographic exceptionalism, low fertility and high life expectancy, varies significantly between nations with different income trajectories. These contrasting development paths allow us to examine how national context shapes the life of people living inside.
From 1825 to 2025:
The fertility rate of a European country is smaller than the world’s average.
A European country has a longer life expectancy than the world’s average.
There is a relationship between income level and life expectancy in a European country.
There is a relationship between income level and fertility rate in a European country.
We began by importing and reading the excel file named world_data.xsxl. Then, we filter data of the year from 1825 to 2025 as filtered_data.
a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses.
Interesting Hypothesis : The fertility rate of a European country is smaller than the world’s average.
Motivation: Most of Europe experienced a substantial reduction in fertility over the past two centuries, with lower fertility rates compared to the rest of the world due to abstinence from marriage. The decline began in France before the end of the eighteenth century, and there are fewer differences between Eastern and Western European trends than expected.
Ref: Coale, A.J., Garaud, M., Szramkiewicz, R., & Burguière, A. (1991). The decline of fertility in Europe from the French Revolution to World War II. Annales. Histoire, Sciences Sociales, 46.
Ref: van de Kaa, D.J. (1987). Europe’s second demographic transition. Population bulletin, 42 1, 1-59.
b. Describe the chosen data, some main descriptive statistics (numbers, plots) related to the tested hypotheses. What are possible conclusions just based on these descriptive statistics?
Germany
Green box Displays babies per woman of Germany throughout the period of 200 years. The data scatters from more than 1 baby to almost 5 babies, with the median of around 2.5 babies per woman.
Blue box: Displays the average of babies per woman of the world throughout the period of 200 years. The data scatters from almost 5 babies to almost 6 babies, with the median of around 5.6 babies per woman.
Belgium
* Descriptions:
1.Pink box: Displays babies per woman of Belgium throughout the period of 200 years. The data scatters from more than 1.7 babies to almost 4.5 babies, with the median of around 2.7 babies per woman.
2.Blue box: Displays the average of babies per woman of the world throughout the period of 200 years. The data scatters from almost 5 babies to almost 6 babies, with the median of around 5.6 babies per woman.
France :
1.Yellow box: Displays babies per woman of Germany throughout the period of 200 years. The data scatters from 2 babies to almost 3.5 babies, with the median of around 2.8 babies per woman.
2.Blue box: Displays the average of babies per woman of the world throughout the period of 200 years. The data scatters from almost 5 babies to almost 6 babies, with the median of around 5.6 babies per woman.
Ireland
1.Purple box: Displays babies per woman of Germany throughout the period of 200 years. The data scatters from more than 2.5 babies to almost 3.6 babies, with the median of around 3.2 babies per woman.
2.Blue box: Displays the average of babies per woman of the world throughout the period of 200 years. The data scatters from almost 5 babies to almost 6 babies, with the median of around 5.6 babies per woman.
4 grouped countries
1.Red box: Displays babies per woman of the group of Ireland, Germany, France and Belgium throughout the period of 200 years. The data scatters from more than 1.7 babies to 4.5 babies, with the median of around 2.6 babies per woman.
2.Blue box: Displays the average of babies per woman of the world throughout the period of 200 years. The data scatters from almost 5 babies to almost 6 babies, with the median of around 5.6 babies per woman.
c. What are null, alternative hypothesis and significance levels?
Null Hypothesis: The chosen country has an equal fertility rate to the world’s average rate.
Alternative Hypothesis: The chosen country has a smaller fertility rate to the world’s average rate.
Significance level: 0.05
d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data?
What the p-value? Note that p-value is the probability of event that the null test statistic is more extreme than the observed value. What is the statistical conclusion for the hypothesis test?
# Merge Germany's life expectancy data with world annual average by year
germany_vs_world <- merge(
germany_babies_by_year,
world_year_avg_babies,
by = "year"
)
# Calculate the difference
germany_vs_world$difference <- germany_vs_world$babies_per_woman - germany_vs_world$world_year_avg_babies
# Variable for the differences
germany_vs_world_differences <- germany_vs_world$difference
# Select only year and difference columns
germany_difference_table <- germany_vs_world[, c("year", "difference")]
# Display the resulting table
germany_difference_table## year difference
## 1 1825 -0.7422335
## 2 1826 -0.8275127
## 3 1827 -0.9209137
## 4 1828 -1.0229442
## 5 1829 -1.1105076
## 6 1830 -1.2054822
## 7 1831 -1.1802538
## 8 1832 -1.1485279
## 9 1833 -1.1320812
## 10 1834 -1.1068020
## 11 1835 -1.0800000
## 12 1836 -1.0757360
## 13 1837 -1.0739594
## 14 1838 -1.0712183
## 15 1839 -1.0713706
## 16 1840 -1.0742640
## 17 1841 -1.0515228
## 18 1842 -1.0249746
## 19 1843 -1.0021320
## 20 1844 -0.9707614
## 21 1845 -0.9474112
## 22 1846 -0.9452284
## 23 1847 -0.9414721
## 24 1848 -0.9520305
## 25 1849 -0.9595431
## 26 1850 -0.9574619
## 27 1851 -1.0943655
## 28 1852 -1.2131472
## 29 1853 -1.3460914
## 30 1854 -1.4737056
## 31 1855 -1.6022843
## 32 1856 -1.4843147
## 33 1857 -1.3750254
## 34 1858 -1.2673096
## 35 1859 -1.1526904
## 36 1860 -1.0404061
## 37 1861 -1.0055330
## 38 1862 -0.9770051
## 39 1863 -0.9386294
## 40 1864 -0.9139086
## 41 1865 -0.8727919
## 42 1866 -0.8382234
## 43 1867 -0.8072589
## 44 1868 -0.7808122
## 45 1869 -0.7574112
## 46 1870 -0.7417259
## 47 1871 -0.6747208
## 48 1872 -0.6120812
## 49 1873 -0.5676650
## 50 1874 -0.5106599
## 51 1875 -0.4606091
## 52 1876 -0.5414213
## 53 1877 -0.6068020
## 54 1878 -0.6692893
## 55 1879 -0.7567005
## 56 1880 -0.8275635
## 57 1881 -0.7498985
## 58 1882 -0.6773604
## 59 1883 -0.6004569
## 60 1884 -0.6115736
## 61 1885 -0.6118782
## 62 1886 -0.6186294
## 63 1887 -0.6453807
## 64 1888 -0.6497970
## 65 1889 -0.6457868
## 66 1890 -0.6374619
## 67 1891 -0.6703046
## 68 1892 -0.6609645
## 69 1893 -0.6771574
## 70 1894 -0.6984264
## 71 1895 -0.7317766
## 72 1896 -0.7434518
## 73 1897 -0.7693909
## 74 1898 -0.7755838
## 75 1899 -0.8220812
## 76 1900 -0.8887310
## 77 1901 -0.9373096
## 78 1902 -0.9912690
## 79 1903 -1.0283756
## 80 1904 -1.1041624
## 81 1905 -1.1657360
## 82 1906 -1.2550761
## 83 1907 -1.3294416
## 84 1908 -1.4185787
## 85 1909 -1.5631472
## 86 1910 -1.7219797
## 87 1911 -1.8783756
## 88 1912 -2.0298985
## 89 1913 -2.1706599
## 90 1914 -2.4004569
## 91 1915 -2.5868528
## 92 1916 -2.8273096
## 93 1917 -3.0562437
## 94 1918 -3.2979695
## 95 1919 -3.2057868
## 96 1920 -3.2087817
## 97 1921 -3.1195431
## 98 1922 -3.0416244
## 99 1923 -2.9620812
## 100 1924 -3.1463959
## 101 1925 -3.3467513
## 102 1926 -3.4472589
## 103 1927 -3.5407614
## 104 1928 -3.5180711
## 105 1929 -3.5489848
## 106 1930 -3.5903046
## 107 1931 -3.7269543
## 108 1932 -3.7825888
## 109 1933 -3.7899492
## 110 1934 -3.4272081
## 111 1935 -3.3015228
## 112 1936 -3.2528426
## 113 1937 -3.2254315
## 114 1938 -3.0604569
## 115 1939 -2.9008629
## 116 1940 -2.8847716
## 117 1941 -3.0267513
## 118 1942 -3.4261421
## 119 1943 -3.2600000
## 120 1944 -3.3682234
## 121 1945 -3.7155838
## 122 1946 -3.2996447
## 123 1947 -3.2869543
## 124 1948 -3.2112690
## 125 1949 -3.1627411
## 126 1950 -3.2363959
## 127 1951 -3.2410152
## 128 1952 -3.2737056
## 129 1953 -3.2730457
## 130 1954 -3.2595939
## 131 1955 -3.2506091
## 132 1956 -3.2187817
## 133 1957 -3.1845178
## 134 1958 -3.1782234
## 135 1959 -3.0989340
## 136 1960 -3.0731980
## 137 1961 -2.9929949
## 138 1962 -2.9993401
## 139 1963 -2.9385787
## 140 1964 -2.9086294
## 141 1965 -2.8960914
## 142 1966 -2.8427919
## 143 1967 -2.8625381
## 144 1968 -2.9008629
## 145 1969 -2.9850254
## 146 1970 -3.1101015
## 147 1971 -3.1343147
## 148 1972 -3.2909645
## 149 1973 -3.3922843
## 150 1974 -3.3595431
## 151 1975 -3.3258883
## 152 1976 -3.2504569
## 153 1977 -3.1700508
## 154 1978 -3.1226904
## 155 1979 -3.0821320
## 156 1980 -2.9677157
## 157 1981 -2.9477157
## 158 1982 -2.9318782
## 159 1983 -2.9496447
## 160 1984 -2.9330457
## 161 1985 -2.8850761
## 162 1986 -2.7875127
## 163 1987 -2.7077665
## 164 1988 -2.6294924
## 165 1989 -2.5928426
## 166 1990 -2.5046193
## 167 1991 -2.5508629
## 168 1992 -2.5078680
## 169 1993 -2.4450761
## 170 1994 -2.4069543
## 171 1995 -2.3185279
## 172 1996 -2.1725888
## 173 1997 -2.0545685
## 174 1998 -2.0071066
## 175 1999 -1.9438071
## 176 2000 -1.8759391
## 177 2001 -1.8572589
## 178 2002 -1.8156345
## 179 2003 -1.7748731
## 180 2004 -1.7226904
## 181 2005 -1.7106091
## 182 2006 -1.6882741
## 183 2007 -1.6321320
## 184 2008 -1.6144670
## 185 2009 -1.6115736
## 186 2010 -1.5492893
## 187 2011 -1.5271066
## 188 2012 -1.4826396
## 189 2013 -1.4337563
## 190 2014 -1.3577157
## 191 2015 -1.2917766
## 192 2016 -1.1621827
## 193 2017 -1.1295939
## 194 2018 -1.0835533
## 195 2019 -1.0661929
## 196 2020 -1.0298985
## 197 2021 -0.9469036
## 198 2022 -1.0198985
## 199 2023 -1.0062437
## 200 2024 -0.9728934
## 201 2025 -0.9348223
→ Sub_Null_hypothesis: The mean of annual differences of the chosen country is equal to the world’s average.
→ Sub_Alternative_Hypothesis: The mean of annual differences of the chosen country is less than the world’s average.
!!IF: Sub_Null_hypothesis is true, then Null Hypothesis is true. Sub_ Alternative hypothesis is true, then Alternative Hypothesis is true. * Test - statistic: Sample Mean Testing ( apply for the annual differences) where: x: annual difference between the chosen country’s fertility rate and the world’s average. n: 200 years X bar: mean of annual differences Germany
# Plot a histogram of the null distribution of mean differences
library(ggplot2)
ggplot(null_distribution, aes(x = stat)) +
geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
labs(
title = "Null Distribution of Mean of Differences (Germany vs World)",
x = "Mean Difference",
y = "Frequency"
) +
theme_minimal()# Calculate the observed statistic (mean difference) from germany_difference_table
observed_stat <- germany_difference_table %>%
specify(response = difference) %>%
calculate(stat = "mean")
observed_stat## Response: difference (numeric)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -1.92
# Visualize the null distribution and shade the p-value region for the observed statistic (direction = "less")
null_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_stat, direction = "less")# Calculate the p-value for the observed statistic (direction = "greater")
p_value <- null_distribution %>%
get_p_value(obs_stat = observed_stat, direction = "less")
p_value## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
# Set the significance level
alpha <- 0.05
# Test conclusion
if (p_value$p_value < alpha) {
conclusion <- "Accept the alternative hypothesis: The mean of annual differences in babies per woman between Germany and the world is less than 0. It means that Germany has lower fertility rates than the rest of world"
} else {
conclusion <- "Reject the alternative hypothesis: There is not enough evidence to conclude that the mean of annual difference in babies per woman between Germany and the world is greater than 0."
}
# Display the conclusion
conclusion## [1] "Accept the alternative hypothesis: The mean of annual differences in babies per woman between Germany and the world is less than 0. It means that Germany has lower fertility rates than the rest of world"
We then do the same across other countries.
*Belgium
## [1] "Accept the alternative hypothesis: Belgium has lower fertility rates than the rest of world."
*France
## [1] "Accept the alternative hypothesis: France has lower fertility rates than the rest of world."
Ireland
## [1] "Accept the alternative hypothesis: Ireland has lower fertility rates than the rest of world."
GROUP OF FOUR COUNTRIES:
## [1] "Accept the alternative hypothesis: The group of four countries has lower fertility rates than the rest of world."
a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses.
Interesting Hypothesis: A European country has a longer life expectancy than the world’s average.
Research Paper: J. Wilmoth et al., 2011 notes that by 1900, industrialized countries had life expectancies of 40-50 years, rising to around 80 years in the healthiest countries by 2000, demonstrating a dramatic improvement in longevity concentrated in developed nations.
Ref: Wilmoth, J.R. (2011). Increase of Human Longevity: Past, Present, and Future. R. Zijdeman et al., 2014 suggests that life expectancy initially diverged in the late 19th and early 20th centuries, with Western European countries experiencing higher life expectancies than the global average. This divergence pattern indicates that industrialized Western Europe achieved mortality reductions earlier than other regions, creating a persistent life expectancy gap.
Ref: https://doi.org/10.1787/9789264214262-10-EN
b.Describe the chosen data, some main descriptive statistics (numbers, plots) related to the tested hypotheses. What are possible conclusions just based on these descriptive statistics?
Germany
Description: The pie chart displays the proportion of years (1825-2025) in which Germany’s life expectancy exceeded the world average. The chart shows “yes” (99%) representing years when Germany had higher life expectancy than the global average, and “no” (1%) representing years when Germany’s life expectancy was equal to or below the world average.
Possible conclusion: Germany maintained life expectancy above the world average for 99% of the 200-year period from 1825 to 2025, with only 1% of years at or below the global average. This consistency demonstrates Germany’s persistent longevity advantage throughout nearly the entire study period.
Ireland
Description:
The pie chart displays the proportion of years (1825-2025) in which Ireland’s life expectancy exceeded the world average. The chart shows two categories: “yes” (97.5%) representing years when Ireland had higher life expectancy than the global average, and “no” (2.5%) representing years when Ireland’s life expe ctancy was equal to or below the world average.
Possible Conclusion:
Ireland maintained life expectancy above the world average for an overwhelming majority (97.5%) of the 200-year period from 1825 to 2025. Only a small fraction (2.5%) of years showed Ireland at or below the global average, indicating a persistent longevity advantage. This pattern provides compelling evidence that Ireland, despite its historical challenges including the Great Famine, successfully achieved and maintained higher life expectancy.
Belgium
Description: The pie chart displays the proportion of years (1825-2025) in which Belgium’s life expectancy exceeded the world average. The chart shows “yes” (100%) representing years when Belgium had higher life expectancy than the global average, with no red segment visible.
Possible Conclusion: Belgium maintained life expectancy above the world average for 100% of the 200-year period from 1825 to 2025, demonstrating perfect consistency in its longevity advantage. This pattern indicates that Belgium never fell to or below the global average throughout the entire study period, reflecting the sustained benefits of Western European industrialization, comprehensive social welfare systems, and continuous public health improvements that placed Belgium consistently ahead of global mortality trends.
France
Description: The pie chart displays the proportion of years (1825-2025) in which France’s life expectancy exceeded the world average. The chart shows “yes” (99.5%) representing years when France had higher life expectancy than the global average, and “no” (0.5%) representing years when France’s life expectancy was equal to or below the world average.
Possible Conclusion: France maintained life expectancy above the world average for 99.5% of the 200-year period, with only 0.5% of years at or below the global average. This exceptional consistency demonstrates France’s persistent longevity advantage, likely reflecting its early demographic transition, and progressive public health reforms, making it one of the earliest countries to achieve and maintain superior life expectancy compared to the rest of the world.
4 countries grouped
Description: The pie chart aggregates data from all four countries (Germany, Belgium, Ireland, and France) to show the combined proportion of years (1825-2025) in which these Western European nations maintained life expectancy above the world average. The chart displays “yes” (99.5%) representing years when at least one country had higher life expectancy than the global average, and “no” (0.5%) representing years when all four countries were at or below the world average.
Possible Conclusion: Collectively, Germany, Belgium, Ireland, and France maintained life expectancy above the world average for virtually the entire 200-year period (99.5%). Only 0.5% of the combined country-years showed these nations at or below the global average, providing evidence that Western European countries consistently enjoyed a substantial longevity advantage throughout 1825-2025, confirming the hypothesis that these industrialized nations achieved and sustained higher life expectancy than the rest of the world.
c.What are null, alternative hypothesis and significance levels?
Null Hypothesis: The chosen country has an equal life expectancy to the world’s average rate.
Alternative Hypothesis: The chosen country has a longer life expectancy to the world’s average rate.
Significance level: 0.05
d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data?
Assume that the null hypothesis is true, write a R code to plot the null distribution. Describe the shape of the null distribution.
# 6. Merge Data
germany_comparison_data <- merge(
germany_life,
world_yearly_life,
by = "year",
suffixes = c("_germany", "_world")
)
# 7. Create Outcome Column (Renamed to 'outcome' to match Hypo2)
germany_comparison_data$outcome <- ifelse(
germany_comparison_data$life_expectancy_at_birth_germany > germany_comparison_data$life_expectancy_at_birth_world,
"yes",
"no"
)→ Sub_Null_hypothesis: The proportion of Yes is 50%. → Sub_Aternative_Hypothesis: The proportion of Yes is > 50%.
!!IF:
Sub_Null_hypothesis is true, then Null Hypothesis is true.
Sub_ Alternative hypothesis is true, then Alternative Hypothesis is true.
where:
P_yes: proportion of yes outcome
n_yes: number of yes outcomes
n_total: 200 outcomes
# --- INFER / HYPOTHESIS TEST SECTION ---
# Specify the null hypothesis using the correct column name 'outcome'
null_hypothesis_germany <- germany_comparison_data %>%
specify(response = outcome, success = "yes") %>%
hypothesize(null = "point", p = 0.5)
# Generate the null distribution
null_distribution_germany <- null_hypothesis_germany %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
# Visualize with Histogram
ggplot(null_distribution_germany, aes(x = stat)) +
geom_histogram(binwidth = 0.02, fill = "skyblue", color = "black") +
labs(
title = "Null Distribution of Proportion of 'Yes' Outcomes",
x = "Proportion of 'Yes'",
y = "Frequency"
) +
theme_minimal()#Calculateing the observed stat of Germany
observed_stat_germany <- germany_comparison_data %>%
specify(response = outcome, success = "yes") %>%
calculate(stat = "prop")
observed_stat_germany## Response: outcome (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.990
null_distribution_germany %>%
visualize() +
shade_p_value(obs_stat = observed_stat_germany$stat, direction = "greater")# Calculate the p-value
(p_value_germany<- null_distribution_germany %>%
get_p_value(obs_stat = observed_stat_germany, direction = "greater"))## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
# Set the significance level
alpha <- 0.05
# Test conclusion
if (p_value_germany$p_value < alpha) {
conclusion <- "Accept the alternative hypothesis: The proportion of Yes outcomes is significantly greater than 50%. It means that Germany has longer life expectancy than the rest of the world"
} else {
conclusion <- "Reject the alternative hypothesis: There is not enough evidence to conclude that the proportion of 'yes' outcomes is greater than 50%."
}
# Display the conclusion
conclusion## [1] "Accept the alternative hypothesis: The proportion of Yes outcomes is significantly greater than 50%. It means that Germany has longer life expectancy than the rest of the world"
We then do the same across other countries.
Ireland
## [1] "Accept the alternative hypothesis: the proportion of Yes outcomes is significantly greater than 50%. It means that Ireland has longer life expectancy than the rest of the world"
France
## [1] "Accept the alternative hypothesis: the proportion of Yes outcomes is significantly greater than 50%. It means that France has longer life expectancy than the rest of the world"
Four countries grouped
## [1] "Accept the alternative hypothesis: the proportion of Yes outcomes is significantly greater than 50%. It means that group of four selected countries has longer life expectancy than the rest of the world"
Converting Belgium’s life expectancy:
# Create life expectancy category for Belgium (as per outline start)
belgium_data <- filtered_data[filtered_data$country == "Belgium", ]
belgium_data$life_expectancy_category <- ifelse(as.numeric(belgium_data$life_expectancy_at_birth) >= 60, "Long Life", "Short Life")DATA GROUPING: Beyond testing every of four selected countries, we also group them as a grouped_four_countries variable for testing. As a result, we have to take some extra steps for calculating their average life_expectancy and average_daily_income. These steps are:
# 1. Filter and Select
filtered_data_h3 <- world_data %>%
filter(year >= 1825 & year <= 2025) %>%
select(year, country, life_expectancy_at_birth, average_daily_income_per_person, income_level)
selected_countries <- c("Germany", "Belgium", "France", "Ireland")
filtered_selected_countries <- filtered_data_h3 %>%
filter(country %in% selected_countries)
# 2. Calculate Averages
avg_income_per_year <- filtered_selected_countries %>%
group_by(year) %>%
summarise(avg_daily_income = mean(average_daily_income_per_person, na.rm = TRUE))
avg_life_expectancy_by_year <- filtered_selected_countries %>%
group_by(year) %>%
summarise(avg_life_expectancy = mean(as.numeric(life_expectancy_at_birth), na.rm = TRUE))a) Sorting the average of average_daily_income of four countries into levels 1,2,3 and 4 based on Gapminder’s sorting.
# Add 'permutated_level' column based on 'avg_daily_income' thresholds
selected_average_income_per_year <- avg_income_per_year %>%
mutate(
permutated_level = case_when(
avg_daily_income <= 2 ~ "level 1",
avg_daily_income <= 8 ~ "level 2",
avg_daily_income <= 32 ~ "level 3",
avg_daily_income < 128 ~ "level 4",
TRUE ~ "level 5"
)
) %>%
left_join(avg_life_expectancy_by_year, by = "year")b) Sorting Yearly Average Life Expectancy into Long Life and Short Life: ( >60 years: Long Life ; <60 years: Short Life)
# Create a new categorical variable for life expectancy
selected_average_income_per_year$life_expectancy_category <- ifelse(
selected_average_income_per_year$avg_life_expectancy >= 60,
"Long Life",
"Short Life"
)
# Convert to factor for categorical analysis
selected_average_income_per_year$life_expectancy_category <- factor(
selected_average_income_per_year$life_expectancy_category,
levels = c("Short Life", "Long Life")
)
# Group Income Levels (Exclude Level 5/Others)
income_levels_1_to_4 <- c("level 1", "level 2", "level 3", "level 4")
selected_average_income_per_year$income_level_grouped <- ifelse(
selected_average_income_per_year$permutated_level %in% income_levels_1_to_4,
selected_average_income_per_year$permutated_level,
NA
)
selected_average_income_per_year$income_level_grouped <- factor(
selected_average_income_per_year$income_level_grouped,
levels = c("level 1", "level 2", "level 3", "level 4")
)a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses. * Interesting Hypothesis: There is a relationship between income level and life expectancy in a European country.
Research background: Jetter, Laudage, and Stadelmann (2019) analyze a panel of 197 countries over 213 years and report a strong association between national income and life expectancy. In their study, GDP per capita accounts for more than 64% of the cross-country variation in life expectancy (The Intimate Link between Income Levels and Life Expectancy: Global Evidence from 213 Years; Social Science Quarterly, doi:10.1111/ssqu.12638). In contrast, evidence from the United States suggests that, although the correlation between income and life expectancy is well documented, the mechanisms and patterns of this relationship are still not fully understood (JAMA, 2016, doi:10.1001/jama.2016.4226).
Motivation and research question: These findings raise the question of how generalizable the global income-life expectancy relationship is when we zoom in on specific high-income countries. In particular, the German case may differ from patterns observed in the United States. This study therefore asks: To what extent is income level associated with life expectancy in Germany over the period 1825–2025? If the global relationship is strong and systematic, why might Germany exhibit a different pattern compared with the US?
b.Describe the chosen data, some main descriptive statistics (numbers, plots) related to the tested hypotheses. What are possible conclusions just based on these descriptive statistics?
Group of four countries
Description: This visualization combines data from Germany, Belgium, France, and Ireland to show the overall relationship between income levels and life expectancy listed countries. The chart displays three income levels (Level 2, Level 3, and Level 4), with each bar representing 100% of the population segmented into two life expectancy categories. Income Level 2 is dominated entirely by “Short Life”, nearly 100% of the population had life expectancy below 60 years, while Income Level 3 shows a transitional pattern with approximately 40-45% “Short Life” and 55-60% “Long Life”, and Income Level 4 almost entirely “Long Life” category. Minimal “Short Life” presence indicates very few people die before age 60.
Possible Outcome: The visual progression provides strong descriptive evidence of a positive association between income level and life expectancy. As income levels rise across these four countries, the probability of living beyond 60 years increases dramatically.
Belgium
* Description: This chart shows Belgium’s demographic
transition across income levels over the 200-year study period. Income
Level 2 is completely dominated by “Short Life”, while Income Level 3
shows approximately 45-50% “Long Life”, and Income Level 4 illustrates a
nearly 100% “Long Life”.
France
Description: France’s chart reveals a distinctive pattern that may reflect its unique demographic history. In this chart, Income Level 2 is 100% “Short Life”. Income Level 3 shows approximately 55% “Long Life,” slightly higher than Belgium. Income Level 4 is nearly 100% “Long Life”.
Possible Conclusion: In France, a stark socioeconomic gradient in life expectancy exists, with low-income populations (Level 2) experiencing nearly universal short life (<60 years) compared to high-income groups (Level 4) where longevity (≥60 years) is virtually guaranteed.
Ireland
Description: While other countries has 3 income level, Ireland’s chart shows a notably different pattern with four income levels represented. At Income Level 1, the population experienced 100% Short Life (age <60). Income Level 2 showed approximately 85-90% Short Life with 10-15% Long Life emerging. Income Level 3 demonstrated a dramatic shift with nearly 100% Long Life (age ≥60), and Income Level 4 maintained 100% Long Life.
**Possible Conclusion*:**It illustrates a clear positive association between income level and life expectancy across Ireland’s 200-year demographic transition.
Germany
Description: Germany’s chart shows a pattern similar to the aggregate but with some distinctive features. At Income Level 2, Germany showed nearly 100% Short Life with a small presence of Long Life (~3-5%). Income Level 3 displayed approximately 40-45% Long Life, indicating a transitional phase. Income Level 4 achieved nearly 100% Long Life, demonstrating Germany’s progression toward high life expectancy at the highest income bracket.
Possible Conclusion: Germany exhibits a gradual income-health gradient with a distinctive early emergence of Long Life even at lower income levels, followed by a moderate transition phase before reaching near-universal longevity at the highest income level.
c. What are null, alternative hypothesis and significance levels? * Null hypothesis: Income level and life expectancy are independent (There is no relationship between income level and life expectancy).
Alternative hypothesis: Income level and life expectancy are dependent (There is a relationship between income level and life expectancy).
Significance Level: 0.05
Test-direction: Right-tailed Test
d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data?
Assume that the null hypothesis is true, write a R code to plot the null distribution. Describe the shape of the null distribution.
where: Oij: Observed Values Eij: Expected Values
Group of four countries
!!! In order to compute Chi - Square Testing, we need to compute observed data and expected data from the group of four
Observed_values (Oij):
# Only keep income levels 1 to 4, set others as NA so they are excluded
income_levels_1_to_4 <- c("level 1", "level 2", "level 3", "level 4")
selected_average_income_per_year$income_level_grouped <- ifelse(
selected_average_income_per_year$permutated_level %in% income_levels_1_to_4,
selected_average_income_per_year$permutated_level,
NA
)
# Set factor levels for income_level_grouped (exclude "Others")
selected_average_income_per_year$income_level_grouped <- factor(
selected_average_income_per_year$income_level_grouped,
levels = c("level 1", "level 2", "level 3", "level 4")
)
# Ensure life_expectancy_category is a factor with desired levels
selected_average_income_per_year$life_expectancy_category <- factor(
selected_average_income_per_year$life_expectancy_category,
levels = c("Short Life", "Long Life")
)
# Create the table (rows: income levels 1-4 only, no "Others")
table(
selected_average_income_per_year$income_level_grouped,
selected_average_income_per_year$life_expectancy_category
)##
## Short Life Long Life
## level 1 0 0
## level 2 88 0
## level 3 29 38
## level 4 0 46
Expected_values (Eij):
# Create the contingency table for life expectancy category vs income level group
observed_table_four_countries <- table(
selected_average_income_per_year$life_expectancy_category,
selected_average_income_per_year$income_level_grouped
)
# Define the expected counts function
expectedIndependent <- function(X) {
n = sum(X)
p = rowSums(X) / n
q = colSums(X) / n
return(p %o% q * n) # outer product creates expected table
}
# Calculate expected counts
E_t_four_countries <- expectedIndependent(observed_table_four_countries)
# Display the expected counts table
E_t_four_countries## level 1 level 2 level 3 level 4
## Short Life 0 51.22388 39 26.77612
## Long Life 0 36.77612 28 19.22388
Observed Stat:
# Calculate the observed chi-squared statistic for Germany's data
observed_stat_four_countries <- selected_average_income_per_year %>%
specify(life_expectancy_category ~ income_level_grouped) %>%
calculate(stat = "Chisq")
observed_stat_four_countries## Response: life_expectancy_category (factor)
## Explanatory: income_level_grouped...
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 133.
Four countries
# Null Distribution
null_distribution_four_countries <- selected_average_income_per_year %>%
specify(life_expectancy_category ~ income_level_grouped) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
ggplot(null_distribution_four_countries, aes(x = stat)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Null Distribution of Chi-squared Statistics (Four Countries)") +
theme_minimal()
Shape of Null Distribution: Left-tailed shape
f. What the p-value? Note that p-value is the probability of
event that the null test statistic is more extreme than the observed
value. What is the statistical conclusion for the hypothesis
test?
## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in four chosen countries."
Belgium
## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in Belgium."
France
## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in France."
Ireland
## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in Ireland."
DATA GROUPING: Beyond testing every of four selected countries, we also group them as a grouped_four_countries variable for testing. As a result, we have to take some extra steps for calculating their average fertility rates and average_daily_income. These steps are:
library(dplyr)
filtered_data <- world_data %>%
filter(year >= 1825 & year <= 2025) %>%
select(year, country, babies_per_woman, average_daily_income_per_person, income_level)
selected_countries <- c("Germany", "Belgium", "France", "Ireland")
filtered_selected_countries <- filtered_data %>%
filter(country %in% selected_countries)
filtered_selected_countries## # A tibble: 804 × 5
## year country babies_per_woman average_daily_income_per_person income_level
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 1825 Belgium 4.74 3.70 level 2
## 2 1826 Belgium 4.74 3.72 level 2
## 3 1827 Belgium 4.73 3.74 level 2
## 4 1828 Belgium 4.73 3.76 level 2
## 5 1829 Belgium 4.72 3.78 level 2
## 6 1830 Belgium 4.72 3.80 level 2
## 7 1831 Belgium 4.77 3.82 level 2
## 8 1832 Belgium 4.82 3.84 level 2
## 9 1833 Belgium 4.87 3.86 level 2
## 10 1834 Belgium 4.92 3.88 level 2
## # ℹ 794 more rows
2. Calculating the average of babies_per_woman of all four countries in every year:
# Calculate the average of average_daily_income_per_person for each year
avg_income_by_year <- filtered_selected_countries %>%
group_by(year) %>%
summarize(avg_daily_income = mean(average_daily_income_per_person, na.rm = TRUE))
avg_income_by_year## # A tibble: 201 × 2
## year avg_daily_income
## <dbl> <dbl>
## 1 1825 2.94
## 2 1826 2.98
## 3 1827 2.98
## 4 1828 3.00
## 5 1829 3.04
## 6 1830 3.04
## 7 1831 3.08
## 8 1832 3.17
## 9 1833 3.17
## 10 1834 3.20
## # ℹ 191 more rows
4.Sorting the average of average_daily_income of four countries into levels 1,2,3 and 4 based on Gapminder’s sorting.
# Add 'permutated_level' column based on 'avg_daily_income' thresholds
selected_average_income_per_year <- avg_income_by_year %>%
mutate(
permutated_level = case_when(
avg_daily_income <= 2 ~ "level 1",
avg_daily_income <= 8 ~ "level 2",
avg_daily_income <= 32 ~ "level 3",
avg_daily_income < 128 ~ "level 4",
TRUE ~ "level 5"
)
)
selected_average_income_per_year## # A tibble: 201 × 3
## year avg_daily_income permutated_level
## <dbl> <dbl> <chr>
## 1 1825 2.94 level 2
## 2 1826 2.98 level 2
## 3 1827 2.98 level 2
## 4 1828 3.00 level 2
## 5 1829 3.04 level 2
## 6 1830 3.04 level 2
## 7 1831 3.08 level 2
## 8 1832 3.17 level 2
## 9 1833 3.17 level 2
## 10 1834 3.20 level 2
## # ℹ 191 more rows
# Combine avg_babies_per_woman_by_year and selected_average_income_per_year by 'year'
combined_babies_income_table <- avg_babies_per_woman_by_year %>%
left_join(selected_average_income_per_year, by = "year")
combined_babies_income_table## # A tibble: 201 × 4
## year avg_babies_per_woman avg_daily_income permutated_level
## <dbl> <dbl> <dbl> <chr>
## 1 1825 4.50 2.94 level 2
## 2 1826 4.47 2.98 level 2
## 3 1827 4.44 2.98 level 2
## 4 1828 4.41 3.00 level 2
## 5 1829 4.38 3.04 level 2
## 6 1830 4.36 3.04 level 2
## 7 1831 4.37 3.08 level 2
## 8 1832 4.38 3.17 level 2
## 9 1833 4.40 3.17 level 2
## 10 1834 4.41 3.20 level 2
## # ℹ 191 more rows
# Calculate the average of avg_babies_per_woman for each permutated_level
avg_babies_by_permutated_level <- combined_babies_income_table %>%
group_by(permutated_level) %>%
summarize(avg_babies_per_woman = mean(avg_babies_per_woman, na.rm = TRUE))
avg_babies_by_permutated_level## # A tibble: 3 × 2
## permutated_level avg_babies_per_woman
## <chr> <dbl>
## 1 level 2 4.02
## 2 level 3 2.59
## 3 level 4 1.73
a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses.
Interesting Hypothesis: There is a relationship between income level and fertility rate in a European country.
Theoretical Background: Income level affects fertility decisions in France and Germany; high-earning females in Germany adapt fertility behavior more strongly due to economic incentives and childcare support differences, suggesting that fertility patterns respond to economic conditions. Ref: https://doi.org/10.2139/ssrn.3616728
Highly educated women have higher second birth risks in France and West Germany; stronger effect in France due to better work-family compatibility. This indicates that the income-fertility relationship varies across income groups depending on institutional support and economic resources. Ref: https://doi.org/10.4054/MPIDR-WP-2004-015
b.Describe the chosen data, some main descriptive statistics
(numbers, plots) related to the tested hypotheses. What are possible
conclusions just based on these descriptive statistics?
Group of four countries
Germany
At Income Level 2, Germany showed a high median fertility rate of approximately 5.0 babies per woman with a wide distribution (range: ~4.3-5.5). Income Level 3 displayed a median of approximately 2.4 babies per woman with substantial variation (range: ~1.6-5.2). Income Level 4 demonstrated the lowest fertility with a median around 1.5 babies per woman and minimal variation (range: ~1.3-2.0).
Overall, Germany exhibits a clear negative association between income level and fertility, with fertility declining sharply as income increases.
Belgium
At Income Level 2, Belgium showed a median fertility rate of approximately 4.6 babies per woman with moderate variation (range: ~4.2-5.0). Income Level 3 displayed a median of approximately 2.5 babies per woman with wide dispersion (range: ~1.6-4.3). Income Level 4 demonstrated a low median fertility of approximately 1.6 babies per woman with tight clustering (range: ~1.4-1.8).
Overall, Belgium demonstrates a strong negative relationship between income and fertility, following a pattern similar to Germany with progressively declining birth rates at higher income levels.
library(ggplot2)
# Create a box plot of babies per woman by income level for Germany
ggplot(belgium_data, aes(x = income_level, y = babies_per_woman)) +
geom_boxplot(fill = "pink", color = "darkblue") +
labs(
title = "Babies per Woman in Belgium by Income Level",
x = "Income Level",
y = "Babies per Woman"
) +
theme_minimal()Ireland
library(ggplot2)
# Create a box plot of babies per woman by income level for Germany
ggplot(ireland_data, aes(x = income_level, y = babies_per_woman)) +
geom_boxplot(fill = "lightgreen", color = "darkblue") +
labs(
title = "Babies per Woman in Ireland by Income Level",
x = "Income Level",
y = "Babies per Woman"
) +
theme_minimal()At Income Level 1, Ireland showed an extremely high median fertility rate of approximately 4.2 babies per woman with minimal variation. Income Level 2 displayed a median of approximately 3.2 babies per woman (range: ~2.6-4.3). Income Level 3 showed a median around 3.2 babies per woman with the widest variation (range: ~1.8-4.1). Income Level 4 demonstrated the lowest fertility with a median of approximately 1.9 babies per woman (range: ~1.6-2.1).
Overall, Ireland exhibits a negative income-fertility relationship, though fertility remains notably higher across all income levels compared to Germany, Belgium, and France, reflecting Ireland’s distinct demographic and cultural context.
France
library(ggplot2)
# Create a box plot of babies per woman by income level for Germany
ggplot(france_data, aes(x = income_level, y = babies_per_woman)) +
geom_boxplot(fill = "Purple", color = "darkblue") +
labs(
title = "Babies per Woman in France by Income Level",
x = "Income Level",
y = "Babies per Woman"
) +
theme_minimal()At Income Level 2, France displayed a median fertility rate of approximately 3.3 babies per woman with considerable variation (range: ~2.1-5.0, with outliers below 2.0). Income Level 3 showed a median of approximately 2.5 babies per woman (range: ~1.5-3.1). Income Level 4 demonstrated the lowest fertility with a median around 1.8 babies per woman and narrow distribution (range: ~1.7-2.0).
Overall, France shows a clear inverse relationship between income and fertility, with birth rates declining systematically as income increases, though France maintains slightly higher fertility at Level 4 compared to Germany and Belgium.
c. What are null, alternative hypothesis and significance levels? Null hypothesis:** The mean number of babies per woman is the same across all income levels.*** ( There is no relationship between fertility and income level. )
Alternative hypothesis: At least one income level has a different mean number of babies per woman. ( There is a relationship between fertility and income level. )
Significance Level: 0.05
Test direction: Right-tailed testing
d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data? Assume that the null hypothesis is true, write a R code to plot the null distribution. Describe the shape of the null distribution. Test statistic: ANOVA → F-statistic
Four countries:
library(infer)
# Specify the null hypothesis: no relationship between permutated_level and mean babies per woman
null_hypothesis_four_countries <- combined_babies_income_table %>%
specify(avg_babies_per_woman ~ permutated_level) %>%
hypothesize(null = "independence")
# Generate the null distribution with 1000 permutations
null_distribution_four_countries <- null_hypothesis_four_countries %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
# Display the first few rows of the null distribution
head(null_distribution_four_countries)## Response: avg_babies_per_woman (numeric)
## Explanatory: permutated_level (facto...
## # A tibble: 6 × 2
## replicate stat
## <int> <dbl>
## 1 1 2.33
## 2 2 0.973
## 3 3 1.14
## 4 4 1.06
## 5 5 0.0573
## 6 6 0.533
# Calculate the observed statistic for combined_babies_income_table
observed_stat_four_countries <- combined_babies_income_table %>%
specify(avg_babies_per_woman ~ permutated_level) %>%
calculate(stat = "F")
observed_stat_four_countries## Response: avg_babies_per_woman (numeric)
## Explanatory: permutated_level (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 600.
Shape of null distribution: Left-tailed
f.What the p-value? Note that p-value is the probability of event that the null test statistic is more extreme than the observed value. What is the statistical conclusion for the hypothesis test?
null_distribution_four_countries %>%
visualize(fill = "black", color = "black") +
shade_p_value(obs_stat = observed_stat_four_countries$stat, direction = "greater") +
theme_minimal() +
theme(
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)# Set the significance level
alpha <- 0.05
# Test conclusion
if (p_value_four_countries$p_value < alpha) {
conclusion <- "Reject the null hypothesis: There is a significant difference in the means of expected babies per woman for income level 1, level 2, level 3 and level 4 in the group of four selected countries."
} else {
conclusion <- "Fail to reject the null hypothesis: There is not enough evidence to conclude that there is a significant difference in the means of expected babies per woman for income level 1, level 2, level 3 and level 4."
}
# Display the conclusion
conclusion## [1] "Reject the null hypothesis: There is a significant difference in the means of expected babies per woman for income level 1, level 2, level 3 and level 4 in the group of four selected countries."