Introduction to Statistics: Final Project

I. Motivation

Our group focused on Germany, Belgium, Ireland, and France because these four Western European nations represent contrasting trajectories in how income relates to health and demographic outcomes over 200 years (1825-2025). Germany exemplifies early industrialization with pioneering social welfare yet faced war disruptions. Belgium shows steady industrial development. Ireland uniquely transformed from extreme poverty and the Great Famine to rapid “Celtic Tiger” modernization. France demonstrates early demographic transition with progressive public health reforms.

By comparing these countries, we can test whether the global income-health relationship holds consistently across different European development models, and whether Western Europe’s demographic exceptionalism, low fertility and high life expectancy, varies significantly between nations with different income trajectories. These contrasting development paths allow us to examine how national context shapes the life of people living inside.

II. Description

Year : From 1825 to 2025. Reflects 200 years of development of some European countries.
Country : Ireland, Belgium, France, and Germany.
babies_per_woman: The average number of children a woman will have in her lifetime.
average_daily_income_per_person: the * average of income per person.
income_level: Indicates whether a country is considered level 1–level 4.
life_expectancy: how long a person lives.
difference: the annual difference between a variable and the world’s average.
permutated_level: added column of income level for sorting income levels based on the yearly average income per person of countries. (based on Gapminder’s sorting)
observed_stat: the number you actually see in your real data.
p-value: the probability of getting a result at least as extreme as your observed_stat, assuming the null hypothesis is true.

III. Interesting Hypotheses

From 1825 to 2025:

The fertility rate of a European country is smaller than the world’s average.
A European country has a longer life expectancy than the world’s average.
There is a relationship between income level and life expectancy in a European country.
There is a relationship between income level and fertility rate in a European country.

IV. Data Processing

Libraries

readr: Efficiently read CSV datasets.
dplyr: Manipulate and transform data, including filtering, joining, grouping, and summarizing variables.
tidyr: Reshape and organize data for analysis and visualization.
ggplot2: Create detailed and flexible plots.
infer: unlock inference mode (such as: hypothesis testing and confidence interval estimation.)

Data Reading and Country Selection

We began by importing and reading the excel file named world_data.xsxl. Then, we filter data of the year from 1825 to 2025 as filtered_data.

V. Hypothesis Testing Report

1.Inference for a numerical variable

a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses.

Interesting Hypothesis : The fertility rate of a European country is smaller than the world’s average.
Motivation: Most of Europe experienced a substantial reduction in fertility over the past two centuries, with lower fertility rates compared to the rest of the world due to abstinence from marriage. The decline began in France before the end of the eighteenth century, and there are fewer differences between Eastern and Western European trends than expected.

Ref: Coale, A.J., Garaud, M., Szramkiewicz, R., & Burguière, A. (1991). The decline of fertility in Europe from the French Revolution to World War II. Annales. Histoire, Sciences Sociales, 46.

Research papers: By 1985, fertility rates in Europe were below the replacement level of 2.1 births/woman in all but Albania, Ireland, Malta, Poland, and Turkey, following a steady decline from a 1965 postwar peak well above 2.5 in Northern, Western, and Southern Europe and an erratic trend from a lower level in Eastern Europe. Natural decreases (fewer births than deaths) had begun already in Austria, Denmark, Hungary, and the Federal Republic of Germany and can be expected shortly in many other countries. According to current UN medium projections, Europe’s population (minus the USSR) will grow only 6% between 1985 and 2025, from 492 to 524 million and 18.4% of the population in 2025 will be 65 and over. The decline to low fertility in the 1930s during Europe’s 1st demographic transition was propelled by a concern for family and offspring. Behind the 2nd transition is a dramatic shift in norms toward progressiveness and individualism, which is moving Europeans away from marriage and parenthood. Cohabitation and out-of-wedlock fertility are increasingly acceptable; having a child is more and more a deliberate choice made to achieve greater self-fulfillment. Many Europeans view population decline and aging as threats to national influence and the welfare state. However, governments outside Eastern Europe, except for France, have hesitated to try politically risky and costly economic pronatalist incentives. As used in Eastern Europe, coupled with some restrictions on legal abortion, such incentives have not managed to boost fertility back up to replacement level. Immigration as a solution is unfeasible. All countries of immigration have now imposed strict controls, tried to stimulate return migration of guestworkers recruited during labor shortages of the 1960s and early 1970s, and now aim at rapid integration of minorities. Only measures compatible with the shift to individualism might slow or reverse the fertility decline, but a rebound to replacement level seems unlikely and long-term population decline appears inevitable for most of Europe.

Ref: van de Kaa, D.J. (1987). Europe’s second demographic transition. Population bulletin, 42 1, 1-59.

b. Describe the chosen data, some main descriptive statistics (numbers, plots) related to the tested hypotheses. What are possible conclusions just based on these descriptive statistics?

Germany

Descriptions:

Green box Displays babies per woman of Germany throughout the period of 200 years. The data scatters from more than 1 baby to almost 5 babies, with the median of around 2.5 babies per woman.
Blue box: Displays the average of babies per woman of the world throughout the period of 200 years. The data scatters from almost 5 babies to almost 6 babies, with the median of around 5.6 babies per woman.

Possible Conclusions: → Germany has a lower fertility rate than the world’s average.

Belgium * Descriptions:

1.Pink box: Displays babies per woman of Belgium throughout the period of 200 years. The data scatters from more than 1.7 babies to almost 4.5 babies, with the median of around 2.7 babies per woman.

2.Blue box: Displays the average of babies per woman of the world throughout the period of 200 years. The data scatters from almost 5 babies to almost 6 babies, with the median of around 5.6 babies per woman.

Possible Conclusions: → Belgium has a lower fertility rate than the world’s average.

France :

Descriptions:

1.Yellow box: Displays babies per woman of Germany throughout the period of 200 years. The data scatters from 2 babies to almost 3.5 babies, with the median of around 2.8 babies per woman.

Possible Conclusions: → France has a lower fertility rate than the world’s average.

Ireland

Descriptions:

1.Purple box: Displays babies per woman of Germany throughout the period of 200 years. The data scatters from more than 2.5 babies to almost 3.6 babies, with the median of around 3.2 babies per woman.

Possible Conclusions: → Ireland has a lower fertility rate than the world’s average.

4 grouped countries

Descriptions:

1.Red box: Displays babies per woman of the group of Ireland, Germany, France and Belgium throughout the period of 200 years. The data scatters from more than 1.7 babies to 4.5 babies, with the median of around 2.6 babies per woman.

Possible Conclusions: → The group of four selected countries has a lower fertility rate than the world’s average.

c. What are null, alternative hypothesis and significance levels?

Null Hypothesis: The chosen country has an equal fertility rate to the world’s average rate.
Alternative Hypothesis: The chosen country has a smaller fertility rate to the world’s average rate.
Significance level: 0.05

d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data?

What the p-value? Note that p-value is the probability of event that the null test statistic is more extreme than the observed value. What is the statistical conclusion for the hypothesis test?

Testing Theory: Comparing the mean of annual differences of the chosen country to the world average babies per woman to 0. If that country’s mean annual difference is less than 0, it means that that country has a smaller fertility rate than the rest of the world, and vice versa

# Merge Germany's life expectancy data with world annual average by year
germany_vs_world <- merge(
  germany_babies_by_year,
  world_year_avg_babies,
  by = "year"
)

# Calculate the difference
germany_vs_world$difference <- germany_vs_world$babies_per_woman - germany_vs_world$world_year_avg_babies

# Variable for the differences
germany_vs_world_differences <- germany_vs_world$difference

# Select only year and difference columns
germany_difference_table <- germany_vs_world[, c("year", "difference")]

# Display the resulting table
germany_difference_table

##     year difference
## 1   1825 -0.7422335
## 2   1826 -0.8275127
## 3   1827 -0.9209137
## 4   1828 -1.0229442
## 5   1829 -1.1105076
## 6   1830 -1.2054822
## 7   1831 -1.1802538
## 8   1832 -1.1485279
## 9   1833 -1.1320812
## 10  1834 -1.1068020
## 11  1835 -1.0800000
## 12  1836 -1.0757360
## 13  1837 -1.0739594
## 14  1838 -1.0712183
## 15  1839 -1.0713706
## 16  1840 -1.0742640
## 17  1841 -1.0515228
## 18  1842 -1.0249746
## 19  1843 -1.0021320
## 20  1844 -0.9707614
## 21  1845 -0.9474112
## 22  1846 -0.9452284
## 23  1847 -0.9414721
## 24  1848 -0.9520305
## 25  1849 -0.9595431
## 26  1850 -0.9574619
## 27  1851 -1.0943655
## 28  1852 -1.2131472
## 29  1853 -1.3460914
## 30  1854 -1.4737056
## 31  1855 -1.6022843
## 32  1856 -1.4843147
## 33  1857 -1.3750254
## 34  1858 -1.2673096
## 35  1859 -1.1526904
## 36  1860 -1.0404061
## 37  1861 -1.0055330
## 38  1862 -0.9770051
## 39  1863 -0.9386294
## 40  1864 -0.9139086
## 41  1865 -0.8727919
## 42  1866 -0.8382234
## 43  1867 -0.8072589
## 44  1868 -0.7808122
## 45  1869 -0.7574112
## 46  1870 -0.7417259
## 47  1871 -0.6747208
## 48  1872 -0.6120812
## 49  1873 -0.5676650
## 50  1874 -0.5106599
## 51  1875 -0.4606091
## 52  1876 -0.5414213
## 53  1877 -0.6068020
## 54  1878 -0.6692893
## 55  1879 -0.7567005
## 56  1880 -0.8275635
## 57  1881 -0.7498985
## 58  1882 -0.6773604
## 59  1883 -0.6004569
## 60  1884 -0.6115736
## 61  1885 -0.6118782
## 62  1886 -0.6186294
## 63  1887 -0.6453807
## 64  1888 -0.6497970
## 65  1889 -0.6457868
## 66  1890 -0.6374619
## 67  1891 -0.6703046
## 68  1892 -0.6609645
## 69  1893 -0.6771574
## 70  1894 -0.6984264
## 71  1895 -0.7317766
## 72  1896 -0.7434518
## 73  1897 -0.7693909
## 74  1898 -0.7755838
## 75  1899 -0.8220812
## 76  1900 -0.8887310
## 77  1901 -0.9373096
## 78  1902 -0.9912690
## 79  1903 -1.0283756
## 80  1904 -1.1041624
## 81  1905 -1.1657360
## 82  1906 -1.2550761
## 83  1907 -1.3294416
## 84  1908 -1.4185787
## 85  1909 -1.5631472
## 86  1910 -1.7219797
## 87  1911 -1.8783756
## 88  1912 -2.0298985
## 89  1913 -2.1706599
## 90  1914 -2.4004569
## 91  1915 -2.5868528
## 92  1916 -2.8273096
## 93  1917 -3.0562437
## 94  1918 -3.2979695
## 95  1919 -3.2057868
## 96  1920 -3.2087817
## 97  1921 -3.1195431
## 98  1922 -3.0416244
## 99  1923 -2.9620812
## 100 1924 -3.1463959
## 101 1925 -3.3467513
## 102 1926 -3.4472589
## 103 1927 -3.5407614
## 104 1928 -3.5180711
## 105 1929 -3.5489848
## 106 1930 -3.5903046
## 107 1931 -3.7269543
## 108 1932 -3.7825888
## 109 1933 -3.7899492
## 110 1934 -3.4272081
## 111 1935 -3.3015228
## 112 1936 -3.2528426
## 113 1937 -3.2254315
## 114 1938 -3.0604569
## 115 1939 -2.9008629
## 116 1940 -2.8847716
## 117 1941 -3.0267513
## 118 1942 -3.4261421
## 119 1943 -3.2600000
## 120 1944 -3.3682234
## 121 1945 -3.7155838
## 122 1946 -3.2996447
## 123 1947 -3.2869543
## 124 1948 -3.2112690
## 125 1949 -3.1627411
## 126 1950 -3.2363959
## 127 1951 -3.2410152
## 128 1952 -3.2737056
## 129 1953 -3.2730457
## 130 1954 -3.2595939
## 131 1955 -3.2506091
## 132 1956 -3.2187817
## 133 1957 -3.1845178
## 134 1958 -3.1782234
## 135 1959 -3.0989340
## 136 1960 -3.0731980
## 137 1961 -2.9929949
## 138 1962 -2.9993401
## 139 1963 -2.9385787
## 140 1964 -2.9086294
## 141 1965 -2.8960914
## 142 1966 -2.8427919
## 143 1967 -2.8625381
## 144 1968 -2.9008629
## 145 1969 -2.9850254
## 146 1970 -3.1101015
## 147 1971 -3.1343147
## 148 1972 -3.2909645
## 149 1973 -3.3922843
## 150 1974 -3.3595431
## 151 1975 -3.3258883
## 152 1976 -3.2504569
## 153 1977 -3.1700508
## 154 1978 -3.1226904
## 155 1979 -3.0821320
## 156 1980 -2.9677157
## 157 1981 -2.9477157
## 158 1982 -2.9318782
## 159 1983 -2.9496447
## 160 1984 -2.9330457
## 161 1985 -2.8850761
## 162 1986 -2.7875127
## 163 1987 -2.7077665
## 164 1988 -2.6294924
## 165 1989 -2.5928426
## 166 1990 -2.5046193
## 167 1991 -2.5508629
## 168 1992 -2.5078680
## 169 1993 -2.4450761
## 170 1994 -2.4069543
## 171 1995 -2.3185279
## 172 1996 -2.1725888
## 173 1997 -2.0545685
## 174 1998 -2.0071066
## 175 1999 -1.9438071
## 176 2000 -1.8759391
## 177 2001 -1.8572589
## 178 2002 -1.8156345
## 179 2003 -1.7748731
## 180 2004 -1.7226904
## 181 2005 -1.7106091
## 182 2006 -1.6882741
## 183 2007 -1.6321320
## 184 2008 -1.6144670
## 185 2009 -1.6115736
## 186 2010 -1.5492893
## 187 2011 -1.5271066
## 188 2012 -1.4826396
## 189 2013 -1.4337563
## 190 2014 -1.3577157
## 191 2015 -1.2917766
## 192 2016 -1.1621827
## 193 2017 -1.1295939
## 194 2018 -1.0835533
## 195 2019 -1.0661929
## 196 2020 -1.0298985
## 197 2021 -0.9469036
## 198 2022 -1.0198985
## 199 2023 -1.0062437
## 200 2024 -0.9728934
## 201 2025 -0.9348223

→ Sub_Null_hypothesis: The mean of annual differences of the chosen country is equal to the world’s average.

→ Sub_Alternative_Hypothesis: The mean of annual differences of the chosen country is less than the world’s average.

!!IF: Sub_Null_hypothesis is true, then Null Hypothesis is true. Sub_ Alternative hypothesis is true, then Alternative Hypothesis is true. * Test - statistic: Sample Mean Testing ( apply for the annual differences) where: x: annual difference between the chosen country’s fertility rate and the world’s average. n: 200 years X bar: mean of annual differences Germany

Null distribution

# Plot a histogram of the null distribution of mean differences
library(ggplot2)

ggplot(null_distribution, aes(x = stat)) +
  geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
  labs(
    title = "Null Distribution of Mean of Differences (Germany vs World)",
    x = "Mean Difference",
    y = "Frequency"
  ) +
  theme_minimal()

Observed Stat

# Calculate the observed statistic (mean difference) from germany_difference_table
observed_stat <- germany_difference_table %>%
  specify(response = difference) %>%
  calculate(stat = "mean")

observed_stat

## Response: difference (numeric)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 -1.92

Null distribution with observed stat

# Visualize the null distribution and shade the p-value region for the observed statistic (direction = "less")
null_distribution %>%
  visualize() +
  shade_p_value(obs_stat = observed_stat, direction = "less")

p-value

# Calculate the p-value for the observed statistic (direction = "greater")
p_value <- null_distribution %>%
  get_p_value(obs_stat = observed_stat, direction = "less")

p_value

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Significance level and conclusion

# Set the significance level
alpha <- 0.05

# Test conclusion
if (p_value$p_value < alpha) {
  conclusion <- "Accept the alternative hypothesis: The mean of annual differences in babies per woman between Germany and the world is less than 0. It means that Germany has lower fertility rates than the rest of world"
} else {
  conclusion <- "Reject the alternative hypothesis: There is not enough evidence to conclude that the mean of annual difference in babies per woman between Germany and the world is greater than 0."
}

# Display the conclusion
conclusion

## [1] "Accept the alternative hypothesis: The mean of annual differences in babies per woman between Germany and the world is less than 0. It means that Germany has lower fertility rates than the rest of world"

We then do the same across other countries.

*Belgium

## [1] "Accept the alternative hypothesis: Belgium has lower fertility rates than the rest of world."

*France

## [1] "Accept the alternative hypothesis: France has lower fertility rates than the rest of world."

Ireland

## [1] "Accept the alternative hypothesis: Ireland has lower fertility rates than the rest of world."

GROUP OF FOUR COUNTRIES:

## [1] "Accept the alternative hypothesis: The group of four countries has lower fertility rates than the rest of world."

2.Inference for a categorical variable

a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses.

Interesting Hypothesis: A European country has a longer life expectancy than the world’s average.
Research Paper: J. Wilmoth et al., 2011 notes that by 1900, industrialized countries had life expectancies of 40-50 years, rising to around 80 years in the healthiest countries by 2000, demonstrating a dramatic improvement in longevity concentrated in developed nations.

Ref: Wilmoth, J.R. (2011). Increase of Human Longevity: Past, Present, and Future. R. Zijdeman et al., 2014 suggests that life expectancy initially diverged in the late 19th and early 20th centuries, with Western European countries experiencing higher life expectancies than the global average. This divergence pattern indicates that industrialized Western Europe achieved mortality reductions earlier than other regions, creating a persistent life expectancy gap.

Ref: https://doi.org/10.1787/9789264214262-10-EN

Motivation: Given this historical evidence of Western European longevity advantage, we hypothesize that Germany, Belgium, Ireland, and France consistently maintained higher life expectancy than the world average throughout the 200-year period (1825-2025), to further more reflecting their earlier access to public health improvements, medical advances, and economic development.

b.Describe the chosen data, some main descriptive statistics (numbers, plots) related to the tested hypotheses. What are possible conclusions just based on these descriptive statistics?

Germany

Description: The pie chart displays the proportion of years (1825-2025) in which Germany’s life expectancy exceeded the world average. The chart shows “yes” (99%) representing years when Germany had higher life expectancy than the global average, and “no” (1%) representing years when Germany’s life expectancy was equal to or below the world average.

Possible conclusion: Germany maintained life expectancy above the world average for 99% of the 200-year period from 1825 to 2025, with only 1% of years at or below the global average. This consistency demonstrates Germany’s persistent longevity advantage throughout nearly the entire study period.

Ireland

Description:

The pie chart displays the proportion of years (1825-2025) in which Ireland’s life expectancy exceeded the world average. The chart shows two categories: “yes” (97.5%) representing years when Ireland had higher life expectancy than the global average, and “no” (2.5%) representing years when Ireland’s life expe ctancy was equal to or below the world average.

Possible Conclusion:

Ireland maintained life expectancy above the world average for an overwhelming majority (97.5%) of the 200-year period from 1825 to 2025. Only a small fraction (2.5%) of years showed Ireland at or below the global average, indicating a persistent longevity advantage. This pattern provides compelling evidence that Ireland, despite its historical challenges including the Great Famine, successfully achieved and maintained higher life expectancy.

Belgium

Description: The pie chart displays the proportion of years (1825-2025) in which Belgium’s life expectancy exceeded the world average. The chart shows “yes” (100%) representing years when Belgium had higher life expectancy than the global average, with no red segment visible.

Possible Conclusion: Belgium maintained life expectancy above the world average for 100% of the 200-year period from 1825 to 2025, demonstrating perfect consistency in its longevity advantage. This pattern indicates that Belgium never fell to or below the global average throughout the entire study period, reflecting the sustained benefits of Western European industrialization, comprehensive social welfare systems, and continuous public health improvements that placed Belgium consistently ahead of global mortality trends.

France

Description: The pie chart displays the proportion of years (1825-2025) in which France’s life expectancy exceeded the world average. The chart shows “yes” (99.5%) representing years when France had higher life expectancy than the global average, and “no” (0.5%) representing years when France’s life expectancy was equal to or below the world average.

Possible Conclusion: France maintained life expectancy above the world average for 99.5% of the 200-year period, with only 0.5% of years at or below the global average. This exceptional consistency demonstrates France’s persistent longevity advantage, likely reflecting its early demographic transition, and progressive public health reforms, making it one of the earliest countries to achieve and maintain superior life expectancy compared to the rest of the world.

4 countries grouped

Description: The pie chart aggregates data from all four countries (Germany, Belgium, Ireland, and France) to show the combined proportion of years (1825-2025) in which these Western European nations maintained life expectancy above the world average. The chart displays “yes” (99.5%) representing years when at least one country had higher life expectancy than the global average, and “no” (0.5%) representing years when all four countries were at or below the world average.

Possible Conclusion: Collectively, Germany, Belgium, Ireland, and France maintained life expectancy above the world average for virtually the entire 200-year period (99.5%). Only 0.5% of the combined country-years showed these nations at or below the global average, providing evidence that Western European countries consistently enjoyed a substantial longevity advantage throughout 1825-2025, confirming the hypothesis that these industrialized nations achieved and sustained higher life expectancy than the rest of the world.

c.What are null, alternative hypothesis and significance levels?

Null Hypothesis: The chosen country has an equal life expectancy to the world’s average rate.
Alternative Hypothesis: The chosen country has a longer life expectancy to the world’s average rate.
Significance level: 0.05

d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data?

Assume that the null hypothesis is true, write a R code to plot the null distribution. Describe the shape of the null distribution.

Testing Theory: Testing by comparing the proportion of annual differences (Difference > 0 - Yes ; Difference =< 0 - No) of the chosen country to the world average life expectancy to 50%. If that country’s proportion of Yes is greater than 50%, it means that that country has a longer life expectancy than the rest of the world, and vice versa.

# 6. Merge Data
germany_comparison_data <- merge(
  germany_life,
  world_yearly_life,
  by = "year",
  suffixes = c("_germany", "_world")
)
# 7. Create Outcome Column (Renamed to 'outcome' to match Hypo2)
germany_comparison_data$outcome <- ifelse(
  germany_comparison_data$life_expectancy_at_birth_germany > germany_comparison_data$life_expectancy_at_birth_world,
  "yes",
  "no"
)

→ Sub_Null_hypothesis: The proportion of Yes is 50%. → Sub_Aternative_Hypothesis: The proportion of Yes is > 50%.

!!IF:

Sub_Null_hypothesis is true, then Null Hypothesis is true.

Sub_ Alternative hypothesis is true, then Alternative Hypothesis is true.

Test statistic: Sample Proportion Testing

where:

P_yes: proportion of yes outcome

n_yes: number of yes outcomes

n_total: 200 outcomes

Null distribution

# --- INFER / HYPOTHESIS TEST SECTION ---

# Specify the null hypothesis using the correct column name 'outcome'
null_hypothesis_germany <- germany_comparison_data %>%
  specify(response = outcome, success = "yes") %>%
  hypothesize(null = "point", p = 0.5)

# Generate the null distribution
null_distribution_germany <- null_hypothesis_germany %>%
  generate(reps = 1000, type = "draw") %>%
  calculate(stat = "prop")

# Visualize with Histogram
ggplot(null_distribution_germany, aes(x = stat)) +
  geom_histogram(binwidth = 0.02, fill = "skyblue", color = "black") +
  labs(
    title = "Null Distribution of Proportion of 'Yes' Outcomes",
    x = "Proportion of 'Yes'",
    y = "Frequency"
  ) +
  theme_minimal()

Observed Stat

#Calculateing the observed stat of Germany
observed_stat_germany <- germany_comparison_data %>%
  specify(response = outcome, success = "yes") %>%
  calculate(stat = "prop")

observed_stat_germany

## Response: outcome (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 0.990

Null distribution with Observed Stat

null_distribution_germany %>%
  visualize() +
  shade_p_value(obs_stat = observed_stat_germany$stat, direction = "greater")

p-value

# Calculate the p-value
(p_value_germany<- null_distribution_germany %>%
  get_p_value(obs_stat = observed_stat_germany, direction =  "greater"))

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Significance level and conclusion

# Set the significance level
alpha <- 0.05

# Test conclusion
if (p_value_germany$p_value < alpha) {
  conclusion <- "Accept the alternative hypothesis: The proportion of Yes outcomes is significantly greater than 50%. It means that Germany has longer life expectancy than the rest of the world"
} else {
  conclusion <- "Reject the alternative hypothesis: There is not enough evidence to conclude that the proportion of 'yes' outcomes is greater than 50%."
}

# Display the conclusion
conclusion

## [1] "Accept the alternative hypothesis: The proportion of Yes outcomes is significantly greater than 50%. It means that Germany has longer life expectancy than the rest of the world"

We then do the same across other countries.

Ireland

## [1] "Accept the alternative hypothesis: the proportion of Yes outcomes is significantly greater than 50%. It means that Ireland has longer life expectancy than the rest of the world"

France

## [1] "Accept the alternative hypothesis: the proportion of Yes outcomes is significantly greater than 50%. It means that France has longer life expectancy than the rest of the world"

Four countries grouped

## [1] "Accept the alternative hypothesis: the proportion of Yes outcomes is significantly greater than 50%. It means that group of four selected countries has longer life expectancy than the rest of the world"

3.Hypothesis test for dependence between two categorical variables

DATA CONVERTING: We have to convert life_expectancy, a numerical variable, into a categorical variable with two levels: “Long Life” and “Short Life” for all four countries’ data and these four as a group

Converting Belgium’s life expectancy:

# Create life expectancy category for Belgium (as per outline start)
belgium_data <- filtered_data[filtered_data$country == "Belgium", ]
belgium_data$life_expectancy_category <- ifelse(as.numeric(belgium_data$life_expectancy_at_birth) >= 60, "Long Life", "Short Life")

DATA GROUPING: Beyond testing every of four selected countries, we also group them as a grouped_four_countries variable for testing. As a result, we have to take some extra steps for calculating their average life_expectancy and average_daily_income. These steps are:

Filter data from the year of 1825 to 2025, and select columns: year, country, life_expectancy_at_birth, average_daily_income_per_person, income_level, as filtered_data.

# 1. Filter and Select
filtered_data_h3 <- world_data %>%
  filter(year >= 1825 & year <= 2025) %>%
  select(year, country, life_expectancy_at_birth, average_daily_income_per_person, income_level)

selected_countries <- c("Germany", "Belgium", "France", "Ireland")
filtered_selected_countries <- filtered_data_h3 %>%
  filter(country %in% selected_countries)

# 2. Calculate Averages
avg_income_per_year <- filtered_selected_countries %>%
  group_by(year) %>%
  summarise(avg_daily_income = mean(average_daily_income_per_person, na.rm = TRUE))

avg_life_expectancy_by_year <- filtered_selected_countries %>%
  group_by(year) %>%
  summarise(avg_life_expectancy = mean(as.numeric(life_expectancy_at_birth), na.rm = TRUE))

a) Sorting the average of average_daily_income of four countries into levels 1,2,3 and 4 based on Gapminder’s sorting.

# Add 'permutated_level' column based on 'avg_daily_income' thresholds
selected_average_income_per_year <- avg_income_per_year %>%
  mutate(
    permutated_level = case_when(
      avg_daily_income <= 2 ~ "level 1",
      avg_daily_income <= 8 ~ "level 2",
      avg_daily_income <= 32 ~ "level 3",
      avg_daily_income < 128 ~ "level 4",
      TRUE ~ "level 5"
    )
  ) %>%
  left_join(avg_life_expectancy_by_year, by = "year")

b) Sorting Yearly Average Life Expectancy into Long Life and Short Life: ( >60 years: Long Life ; <60 years: Short Life)

# Create a new categorical variable for life expectancy
selected_average_income_per_year$life_expectancy_category <- ifelse(
  selected_average_income_per_year$avg_life_expectancy >= 60,
  "Long Life",
  "Short Life"
)

# Convert to factor for categorical analysis
selected_average_income_per_year$life_expectancy_category <- factor(
  selected_average_income_per_year$life_expectancy_category,
  levels = c("Short Life", "Long Life")
)

# Group Income Levels (Exclude Level 5/Others)
income_levels_1_to_4 <- c("level 1", "level 2", "level 3", "level 4")
selected_average_income_per_year$income_level_grouped <- ifelse(
  selected_average_income_per_year$permutated_level %in% income_levels_1_to_4,
  selected_average_income_per_year$permutated_level,
  NA
)

selected_average_income_per_year$income_level_grouped <- factor(
  selected_average_income_per_year$income_level_grouped,
  levels = c("level 1", "level 2", "level 3", "level 4")
)

a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses. * Interesting Hypothesis: There is a relationship between income level and life expectancy in a European country.

Research background: Jetter, Laudage, and Stadelmann (2019) analyze a panel of 197 countries over 213 years and report a strong association between national income and life expectancy. In their study, GDP per capita accounts for more than 64% of the cross-country variation in life expectancy (The Intimate Link between Income Levels and Life Expectancy: Global Evidence from 213 Years; Social Science Quarterly, doi:10.1111/ssqu.12638). In contrast, evidence from the United States suggests that, although the correlation between income and life expectancy is well documented, the mechanisms and patterns of this relationship are still not fully understood (JAMA, 2016, doi:10.1001/jama.2016.4226).
Motivation and research question: These findings raise the question of how generalizable the global income-life expectancy relationship is when we zoom in on specific high-income countries. In particular, the German case may differ from patterns observed in the United States. This study therefore asks: To what extent is income level associated with life expectancy in Germany over the period 1825–2025? If the global relationship is strong and systematic, why might Germany exhibit a different pattern compared with the US?

b.Describe the chosen data, some main descriptive statistics (numbers, plots) related to the tested hypotheses. What are possible conclusions just based on these descriptive statistics?

Group of four countries

Description: This visualization combines data from Germany, Belgium, France, and Ireland to show the overall relationship between income levels and life expectancy listed countries. The chart displays three income levels (Level 2, Level 3, and Level 4), with each bar representing 100% of the population segmented into two life expectancy categories. Income Level 2 is dominated entirely by “Short Life”, nearly 100% of the population had life expectancy below 60 years, while Income Level 3 shows a transitional pattern with approximately 40-45% “Short Life” and 55-60% “Long Life”, and Income Level 4 almost entirely “Long Life” category. Minimal “Short Life” presence indicates very few people die before age 60.
Possible Outcome: The visual progression provides strong descriptive evidence of a positive association between income level and life expectancy. As income levels rise across these four countries, the probability of living beyond 60 years increases dramatically.

Belgium * Description: This chart shows Belgium’s demographic transition across income levels over the 200-year study period. Income Level 2 is completely dominated by “Short Life”, while Income Level 3 shows approximately 45-50% “Long Life”, and Income Level 4 illustrates a nearly 100% “Long Life”.

Possible Conclusion: It shows a clear three-stage progression consistent with classic demographic transition theory, though the Level 3 transition appears slightly more gradual than some other countries in the study.

France

Description: France’s chart reveals a distinctive pattern that may reflect its unique demographic history. In this chart, Income Level 2 is 100% “Short Life”. Income Level 3 shows approximately 55% “Long Life,” slightly higher than Belgium. Income Level 4 is nearly 100% “Long Life”.
Possible Conclusion: In France, a stark socioeconomic gradient in life expectancy exists, with low-income populations (Level 2) experiencing nearly universal short life (<60 years) compared to high-income groups (Level 4) where longevity (≥60 years) is virtually guaranteed.

Ireland

Description: While other countries has 3 income level, Ireland’s chart shows a notably different pattern with four income levels represented. At Income Level 1, the population experienced 100% Short Life (age <60). Income Level 2 showed approximately 85-90% Short Life with 10-15% Long Life emerging. Income Level 3 demonstrated a dramatic shift with nearly 100% Long Life (age ≥60), and Income Level 4 maintained 100% Long Life.
**Possible Conclusion*:**It illustrates a clear positive association between income level and life expectancy across Ireland’s 200-year demographic transition.

Germany

Description: Germany’s chart shows a pattern similar to the aggregate but with some distinctive features. At Income Level 2, Germany showed nearly 100% Short Life with a small presence of Long Life (~3-5%). Income Level 3 displayed approximately 40-45% Long Life, indicating a transitional phase. Income Level 4 achieved nearly 100% Long Life, demonstrating Germany’s progression toward high life expectancy at the highest income bracket.
Possible Conclusion: Germany exhibits a gradual income-health gradient with a distinctive early emergence of Long Life even at lower income levels, followed by a moderate transition phase before reaching near-universal longevity at the highest income level.

c. What are null, alternative hypothesis and significance levels? * Null hypothesis: Income level and life expectancy are independent (There is no relationship between income level and life expectancy).

Alternative hypothesis: Income level and life expectancy are dependent (There is a relationship between income level and life expectancy).
Significance Level: 0.05
Test-direction: Right-tailed Test

d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data?

Assume that the null hypothesis is true, write a R code to plot the null distribution. Describe the shape of the null distribution.

Test- statisitc: Chi-Square Testing

where: Oij: Observed Values Eij: Expected Values

Group of four countries

!!! In order to compute Chi - Square Testing, we need to compute observed data and expected data from the group of four

Observed_values (Oij):

# Only keep income levels 1 to 4, set others as NA so they are excluded
income_levels_1_to_4 <- c("level 1", "level 2", "level 3", "level 4")
selected_average_income_per_year$income_level_grouped <- ifelse(
  selected_average_income_per_year$permutated_level %in% income_levels_1_to_4,
  selected_average_income_per_year$permutated_level,
  NA
)

# Set factor levels for income_level_grouped (exclude "Others")
selected_average_income_per_year$income_level_grouped <- factor(
  selected_average_income_per_year$income_level_grouped,
  levels = c("level 1", "level 2", "level 3", "level 4")
)

# Ensure life_expectancy_category is a factor with desired levels
selected_average_income_per_year$life_expectancy_category <- factor(
  selected_average_income_per_year$life_expectancy_category,
  levels = c("Short Life", "Long Life")
)

# Create the table (rows: income levels 1-4 only, no "Others")
table(
  selected_average_income_per_year$income_level_grouped,
  selected_average_income_per_year$life_expectancy_category
)

##          
##           Short Life Long Life
##   level 1          0         0
##   level 2         88         0
##   level 3         29        38
##   level 4          0        46

Expected_values (Eij):

# Create the contingency table for life expectancy category vs income level group
observed_table_four_countries <- table(
  selected_average_income_per_year$life_expectancy_category,
  selected_average_income_per_year$income_level_grouped
)

# Define the expected counts function
expectedIndependent <- function(X) {
  n = sum(X)
  p = rowSums(X) / n
  q = colSums(X) / n
  return(p %o% q * n) # outer product creates expected table
}

# Calculate expected counts
E_t_four_countries <- expectedIndependent(observed_table_four_countries)

# Display the expected counts table
E_t_four_countries

##            level 1  level 2 level 3  level 4
## Short Life       0 51.22388      39 26.77612
## Long Life        0 36.77612      28 19.22388

Observed Stat:

# Calculate the observed chi-squared statistic for Germany's data
observed_stat_four_countries <- selected_average_income_per_year %>%
  specify(life_expectancy_category ~ income_level_grouped) %>%
  calculate(stat = "Chisq")

observed_stat_four_countries

## Response: life_expectancy_category (factor)
## Explanatory: income_level_grouped...
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  133.

Four countries

# Null Distribution
null_distribution_four_countries <- selected_average_income_per_year %>%
  specify(life_expectancy_category ~ income_level_grouped) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "Chisq")

ggplot(null_distribution_four_countries, aes(x = stat)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  labs(title = "Null Distribution of Chi-squared Statistics (Four Countries)") +
  theme_minimal()

Shape of Null Distribution: Left-tailed shape

f. What the p-value? Note that p-value is the probability of event that the null test statistic is more extreme than the observed value. What is the statistical conclusion for the hypothesis test?

## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in four chosen countries."

Belgium

## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in Belgium."

France

## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in France."

Ireland

## [1] "Reject the null hypothesis: There is a significant relationship between income level and life expectancy category in Ireland."

4.Hypothesis test for dependence between a numerical and a categorical variable

DATA GROUPING: Beyond testing every of four selected countries, we also group them as a grouped_four_countries variable for testing. As a result, we have to take some extra steps for calculating their average fertility rates and average_daily_income. These steps are:

Filter data from the year of 1825 to 2025, and select columns: year, country, babies_per_woman, average_daily_income_per_person, income_level. Select out 4 chosen countries.

library(dplyr)

filtered_data <- world_data %>%
  filter(year >= 1825 & year <= 2025) %>%
  select(year, country, babies_per_woman, average_daily_income_per_person, income_level)


selected_countries <- c("Germany", "Belgium", "France", "Ireland")

filtered_selected_countries <- filtered_data %>%
  filter(country %in% selected_countries)

filtered_selected_countries

## # A tibble: 804 × 5
##     year country babies_per_woman average_daily_income_per_person income_level
##    <dbl> <chr>              <dbl>                           <dbl> <chr>       
##  1  1825 Belgium             4.74                            3.70 level 2     
##  2  1826 Belgium             4.74                            3.72 level 2     
##  3  1827 Belgium             4.73                            3.74 level 2     
##  4  1828 Belgium             4.73                            3.76 level 2     
##  5  1829 Belgium             4.72                            3.78 level 2     
##  6  1830 Belgium             4.72                            3.80 level 2     
##  7  1831 Belgium             4.77                            3.82 level 2     
##  8  1832 Belgium             4.82                            3.84 level 2     
##  9  1833 Belgium             4.87                            3.86 level 2     
## 10  1834 Belgium             4.92                            3.88 level 2     
## # ℹ 794 more rows

2. Calculating the average of babies_per_woman of all four countries in every year:

Calculating the average of average_daily_income_per_personn of all four countries in every year:

# Calculate the average of average_daily_income_per_person for each year
avg_income_by_year <- filtered_selected_countries %>%
  group_by(year) %>%
  summarize(avg_daily_income = mean(average_daily_income_per_person, na.rm = TRUE))

avg_income_by_year

## # A tibble: 201 × 2
##     year avg_daily_income
##    <dbl>            <dbl>
##  1  1825             2.94
##  2  1826             2.98
##  3  1827             2.98
##  4  1828             3.00
##  5  1829             3.04
##  6  1830             3.04
##  7  1831             3.08
##  8  1832             3.17
##  9  1833             3.17
## 10  1834             3.20
## # ℹ 191 more rows

4.Sorting the average of average_daily_income of four countries into levels 1,2,3 and 4 based on Gapminder’s sorting.

# Add 'permutated_level' column based on 'avg_daily_income' thresholds
selected_average_income_per_year <- avg_income_by_year %>%
  mutate(
    permutated_level = case_when(
      avg_daily_income <= 2 ~ "level 1",
      avg_daily_income <= 8 ~ "level 2",
      avg_daily_income <= 32 ~ "level 3",
      avg_daily_income < 128 ~ "level 4",
      TRUE ~ "level 5"
    )
  )


selected_average_income_per_year

## # A tibble: 201 × 3
##     year avg_daily_income permutated_level
##    <dbl>            <dbl> <chr>           
##  1  1825             2.94 level 2         
##  2  1826             2.98 level 2         
##  3  1827             2.98 level 2         
##  4  1828             3.00 level 2         
##  5  1829             3.04 level 2         
##  6  1830             3.04 level 2         
##  7  1831             3.08 level 2         
##  8  1832             3.17 level 2         
##  9  1833             3.17 level 2         
## 10  1834             3.20 level 2         
## # ℹ 191 more rows

Combine data together:

# Combine avg_babies_per_woman_by_year and selected_average_income_per_year by 'year'
combined_babies_income_table <- avg_babies_per_woman_by_year %>%
  left_join(selected_average_income_per_year, by = "year")

combined_babies_income_table

## # A tibble: 201 × 4
##     year avg_babies_per_woman avg_daily_income permutated_level
##    <dbl>                <dbl>            <dbl> <chr>           
##  1  1825                 4.50             2.94 level 2         
##  2  1826                 4.47             2.98 level 2         
##  3  1827                 4.44             2.98 level 2         
##  4  1828                 4.41             3.00 level 2         
##  5  1829                 4.38             3.04 level 2         
##  6  1830                 4.36             3.04 level 2         
##  7  1831                 4.37             3.08 level 2         
##  8  1832                 4.38             3.17 level 2         
##  9  1833                 4.40             3.17 level 2         
## 10  1834                 4.41             3.20 level 2         
## # ℹ 191 more rows

Calculate the average of avg_babies_per_woman for each permutated_level:

# Calculate the average of avg_babies_per_woman for each permutated_level
avg_babies_by_permutated_level <- combined_babies_income_table %>%
  group_by(permutated_level) %>%
  summarize(avg_babies_per_woman = mean(avg_babies_per_woman, na.rm = TRUE))

avg_babies_by_permutated_level

## # A tibble: 3 × 2
##   permutated_level avg_babies_per_woman
##   <chr>                           <dbl>
## 1 level 2                          4.02
## 2 level 3                          2.59
## 3 level 4                          1.73

a. State clearly the interesting hypotheses. What are related research papers, and your motivations to test these hypotheses.

Interesting Hypothesis: There is a relationship between income level and fertility rate in a European country.
Theoretical Background: Income level affects fertility decisions in France and Germany; high-earning females in Germany adapt fertility behavior more strongly due to economic incentives and childcare support differences, suggesting that fertility patterns respond to economic conditions. Ref: https://doi.org/10.2139/ssrn.3616728

Highly educated women have higher second birth risks in France and West Germany; stronger effect in France due to better work-family compatibility. This indicates that the income-fertility relationship varies across income groups depending on institutional support and economic resources. Ref: https://doi.org/10.4054/MPIDR-WP-2004-015

Motivation: Given this evidence that fertility behavior responds to economic status and varies systematically across income groups, we hypothesize that the mean number of babies per woman differs significantly across income levels in these four European countries, with economic factors playing a key role in shaping reproductive decisions throughout the 1825-2025 period

b.Describe the chosen data, some main descriptive statistics (numbers, plots) related to the tested hypotheses. What are possible conclusions just based on these descriptive statistics? Group of four countries

Description: The boxplot displays the distribution of babies per woman across four income levels. We observe a trend where lower income levels (Level 1, Level 2) tend to have higher median fertility rates and wider variability compared to higher income levels (Level 3, Level 4), which show lower and more consistent fertility rates. This suggests a potential negative relationship between income and fertility.

Germany

At Income Level 2, Germany showed a high median fertility rate of approximately 5.0 babies per woman with a wide distribution (range: ~4.3-5.5). Income Level 3 displayed a median of approximately 2.4 babies per woman with substantial variation (range: ~1.6-5.2). Income Level 4 demonstrated the lowest fertility with a median around 1.5 babies per woman and minimal variation (range: ~1.3-2.0).

Overall, Germany exhibits a clear negative association between income level and fertility, with fertility declining sharply as income increases.

Belgium

At Income Level 2, Belgium showed a median fertility rate of approximately 4.6 babies per woman with moderate variation (range: ~4.2-5.0). Income Level 3 displayed a median of approximately 2.5 babies per woman with wide dispersion (range: ~1.6-4.3). Income Level 4 demonstrated a low median fertility of approximately 1.6 babies per woman with tight clustering (range: ~1.4-1.8).

Overall, Belgium demonstrates a strong negative relationship between income and fertility, following a pattern similar to Germany with progressively declining birth rates at higher income levels.

library(ggplot2)

# Create a box plot of babies per woman by income level for Germany
ggplot(belgium_data, aes(x = income_level, y = babies_per_woman)) +
  geom_boxplot(fill = "pink", color = "darkblue") +
  labs(
    title = "Babies per Woman in Belgium by Income Level",
    x = "Income Level",
    y = "Babies per Woman"
  ) +
  theme_minimal()

Ireland

library(ggplot2)

# Create a box plot of babies per woman by income level for Germany
ggplot(ireland_data, aes(x = income_level, y = babies_per_woman)) +
  geom_boxplot(fill = "lightgreen", color = "darkblue") +
  labs(
    title = "Babies per Woman in Ireland by Income Level",
    x = "Income Level",
    y = "Babies per Woman"
  ) +
  theme_minimal()

At Income Level 1, Ireland showed an extremely high median fertility rate of approximately 4.2 babies per woman with minimal variation. Income Level 2 displayed a median of approximately 3.2 babies per woman (range: ~2.6-4.3). Income Level 3 showed a median around 3.2 babies per woman with the widest variation (range: ~1.8-4.1). Income Level 4 demonstrated the lowest fertility with a median of approximately 1.9 babies per woman (range: ~1.6-2.1).

Overall, Ireland exhibits a negative income-fertility relationship, though fertility remains notably higher across all income levels compared to Germany, Belgium, and France, reflecting Ireland’s distinct demographic and cultural context.

France

library(ggplot2)

# Create a box plot of babies per woman by income level for Germany
ggplot(france_data, aes(x = income_level, y = babies_per_woman)) +
  geom_boxplot(fill = "Purple", color = "darkblue") +
  labs(
    title = "Babies per Woman in France by Income Level",
    x = "Income Level",
    y = "Babies per Woman"
  ) +
  theme_minimal()

At Income Level 2, France displayed a median fertility rate of approximately 3.3 babies per woman with considerable variation (range: ~2.1-5.0, with outliers below 2.0). Income Level 3 showed a median of approximately 2.5 babies per woman (range: ~1.5-3.1). Income Level 4 demonstrated the lowest fertility with a median around 1.8 babies per woman and narrow distribution (range: ~1.7-2.0).

Overall, France shows a clear inverse relationship between income and fertility, with birth rates declining systematically as income increases, though France maintains slightly higher fertility at Level 4 compared to Germany and Belgium.

c. What are null, alternative hypothesis and significance levels? Null hypothesis:** The mean number of babies per woman is the same across all income levels.*** ( There is no relationship between fertility and income level. )

Alternative hypothesis: At least one income level has a different mean number of babies per woman. ( There is a relationship between fertility and income level. )
Significance Level: 0.05
Test direction: Right-tailed testing

d/e. What is your chosen test statistic? What is the value of the test statistic observed from the data? Assume that the null hypothesis is true, write a R code to plot the null distribution. Describe the shape of the null distribution. Test statistic: ANOVA → F-statistic

Four countries:

library(infer)

# Specify the null hypothesis: no relationship between permutated_level and mean babies per woman
null_hypothesis_four_countries <- combined_babies_income_table %>%
  specify(avg_babies_per_woman ~ permutated_level) %>%
  hypothesize(null = "independence")

# Generate the null distribution with 1000 permutations
null_distribution_four_countries <- null_hypothesis_four_countries %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "F")

# Display the first few rows of the null distribution
head(null_distribution_four_countries)

## Response: avg_babies_per_woman (numeric)
## Explanatory: permutated_level (facto...
## # A tibble: 6 × 2
##   replicate   stat
##       <int>  <dbl>
## 1         1 2.33  
## 2         2 0.973 
## 3         3 1.14  
## 4         4 1.06  
## 5         5 0.0573
## 6         6 0.533

hist(null_distribution_four_countries$stat)

Null distribution

# Calculate the observed statistic for combined_babies_income_table
observed_stat_four_countries <- combined_babies_income_table %>%
  specify(avg_babies_per_woman ~ permutated_level) %>%
  calculate(stat = "F")

observed_stat_four_countries

## Response: avg_babies_per_woman (numeric)
## Explanatory: permutated_level (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  600.

Shape of null distribution: Left-tailed

f.What the p-value? Note that p-value is the probability of event that the null test statistic is more extreme than the observed value. What is the statistical conclusion for the hypothesis test?

null_distribution_four_countries %>%
  visualize(fill = "black", color = "black") +
  shade_p_value(obs_stat = observed_stat_four_countries$stat, direction = "greater") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "white", color = NA),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Significance level and conclusion

# Set the significance level
alpha <- 0.05

# Test conclusion
if (p_value_four_countries$p_value < alpha) {
  conclusion <- "Reject the null hypothesis: There is a significant difference in the means of expected babies per woman for income level 1, level 2, level 3 and level 4 in the group of four selected countries."
} else {
  conclusion <- "Fail to reject the null hypothesis: There is not enough evidence to conclude that there is a significant difference in the means of expected babies per woman for income level 1, level 2, level 3 and level 4."
}

# Display the conclusion
conclusion

## [1] "Reject the null hypothesis: There is a significant difference in the means of expected babies per woman for income level 1, level 2, level 3 and level 4 in the group of four selected countries."

Introduction to Statistics: Final Project

Nguyen Chau Ngoc - Nguyen Quynh Nhi - Nguyen Van Trung - Dong Le Anh

2025-12-15

I. Motivation

II. Description

III. Interesting Hypotheses

IV. Data Processing

V. Hypothesis Testing Report

1.Inference for a numerical variable

2.Inference for a categorical variable

3.Hypothesis test for dependence between two categorical variables

4.Hypothesis test for dependence between a numerical and a categorical variable