Load required packages
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
Load the World Bank Dataset
#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))
## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
world_bank
## # A tibble: 1,675 × 19
## Time `Time Code` `Country Name` `Country Code` Region `Income Group`
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2000 YR2000 Brazil BRA Latin America… Upper middle …
## 2 2000 YR2000 China CHN East Asia & P… Upper middle …
## 3 2000 YR2000 France FRA Europe & Cent… High income
## 4 2000 YR2000 Germany DEU Europe & Cent… High income
## 5 2000 YR2000 India IND South Asia Lower middle …
## 6 2000 YR2000 Indonesia IDN East Asia & P… Upper middle …
## 7 2000 YR2000 Italy ITA Europe & Cent… High income
## 8 2000 YR2000 Japan JPN East Asia & P… High income
## 9 2000 YR2000 Korea, Rep. KOR East Asia & P… High income
## 10 2000 YR2000 Mexico MEX Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## # `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## # `Unemployment, total (% of total labor force)` <dbl>,
## # `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## # `Population, total` <dbl>,
## # `Exports of goods and services (% of GDP)` <dbl>, …
dim(world_bank)
## [1] 1675 19
# Check column data types
glimpse(world_bank)
## Rows: 1,675
## Columns: 19
## $ Time <dbl> 2000, 20…
## $ `Time Code` <chr> "YR2000"…
## $ `Country Name` <chr> "Brazil"…
## $ `Country Code` <chr> "BRA", "…
## $ Region <chr> "Latin A…
## $ `Income Group` <chr> "Upper m…
## $ `GDP (constant 2015 US$)` <dbl> 1.18642e…
## $ `GDP growth (annual %)` <dbl> 4.387949…
## $ `GDP (current US$)` <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)` <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)` <dbl> 7.044141…
## $ `Labor force, total` <dbl> 80295093…
## $ `Population, total` <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)` <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)` <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)` <dbl> 5.033917…
## $ `Gross savings (% of GDP)` <dbl> 13.99170…
## $ `Current account balance (% of GDP)` <dbl> -4.04774…
# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)
# Clean column names
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
df <- world_bank |> clean_names()
glimpse(df)
## Rows: 1,675
## Columns: 19
## $ time <int> 2000, …
## $ time_code <chr> "YR200…
## $ country_name <chr> "Brazi…
## $ country_code <chr> "BRA",…
## $ region <chr> "Latin…
## $ income_group <chr> "Upper…
## $ gdp_constant_2015_us <dbl> 1.1864…
## $ gdp_growth_annual_percent <dbl> 4.3879…
## $ gdp_current_us <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent <dbl> 7.0441…
## $ labor_force_total <dbl> 802950…
## $ population_total <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp <dbl> 5.0339…
## $ gross_savings_percent_of_gdp <dbl> 13.991…
## $ current_account_balance_percent_of_gdp <dbl> -4.047…
Main Variable: gdp_growth_annual_percent (continuous)
Grouping Variable: income_group
Let:
Group A = High income countries
Group B = Middle & Low income countries
df$income_binary <- ifelse(df$income_group == "High income",
"High income",
"Non High income")
df$income_binary <- as.factor(df$income_binary)
table(df$income_binary)
##
## High income Non High income
## 600 1075
Research Question
Do high-income countries have different GDP growth rates than other countries?
\[H_0: \mu_{\text{High Income}} = \mu_{\text{Non-High Income}}\] \[H_1: \mu_{\text{High Income}} \neq \mu_{\text{Non-High Income}}\]
Two Sample t-test
α = 0.05
Reason : Standard in economics. False positive (claiming difference when none exists) is moderately costly but acceptable at 5%.
Power (1 − β) = 0.8
Reason : We want 80% probability of detecting a meaningful difference.
Minimum Effect Size (Coehn’s d) = 0.3
Reason : A small-to-moderate difference in GDP growth (around 1 percentage point) is economically meaningful at the macro level. Even small growth differences compound over time.
library(pwr)
pwr_result <-pwr.t.test( d = 0.3,
power = 0.8,
sig.level = 0.05,
type = "two.sample"
)
group_counts <- df |>
filter(!is.na(gdp_growth_annual_percent)) |>
count(income_binary)
print(group_counts)
## # A tibble: 2 × 2
## income_binary n
## <fct> <int>
## 1 High income 600
## 2 Non High income 1073
n_required <- ceiling(pwr_result$n)
cat("\nRequired per group:", n_required, "\n")
##
## Required per group: 176
cat("Do we have enough data?",
all(group_counts$n >= n_required), "\n")
## Do we have enough data? TRUE
Yes, We have enough data as required n per group is less than sample size.
t_test1 <- t.test(gdp_growth_annual_percent ~ income_binary,
data = df,
var.equal = FALSE)
t_test1
##
## Welch Two Sample t-test
##
## data: gdp_growth_annual_percent by income_binary
## t = -11.271, df = 1593.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High income and group Non High income is not equal to 0
## 95 percent confidence interval:
## -2.439632 -1.716378
## sample estimates:
## mean in group High income mean in group Non High income
## 2.565952 4.643957
The mean GDP growth rate for: High-income countries: 2.57%
Non-high-income countries: 4.64%
Because the p-value is far below the chosen significance level (α = 0.05), we reject the null hypothesis that the two groups have equal mean GDP growth rates.
library(effsize)
cohen.d(gdp_growth_annual_percent ~ income_binary, data = df)
##
## Cohen's d
##
## d estimate: -0.5206472 (medium)
## 95 percent confidence interval:
## lower upper
## -0.6221788 -0.4191155
This represents a medium effect size, indicating that the difference is not only statistically significant but also practically meaningful.
alpha <- 0.05
result_text <- ifelse(t_test1$p.value < alpha,
paste0("We **reject** the null hypothesis (p = ", round(t_test1$p.value, 4), " < α = ", alpha, ")."),
paste0("We **fail to reject** the null hypothesis (p = ", round(t_test1$p.value, 4), " ≥ α = ", alpha, ")."))
cat(result_text)
## We **reject** the null hypothesis (p = 0 < α = 0.05).
The negative sign of the test statistic and effect size indicates that high-income countries experience, on average, lower GDP growth rates compared to middle- and low-income countries.Economically, this finding is consistent with growth theory as developing economies often grow faster due to industrial expansion, capital accumulation, and structural transformation.High-income economies tend to grow more slowly because they are already near the technological and productivity frontier.
library(ggplot2)
ggplot(df, aes(x = income_binary,
y = gdp_growth_annual_percent)) +
geom_boxplot() +
labs(title = "GDP Growth by Income Group",
x = "Income Group",
y = "GDP Growth (%)")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplot shows that non-high-income countries have a higher median GDP growth rate than high-income countries. Non-high-income economies also display greater variability and more extreme growth outliers, indicating more volatile growth patterns. In contrast, high-income countries exhibit lower but more stable growth rates.
Is the gap changing over time?
Are non-high-income countries converging toward high-income growth rates, or is the gap stable across decades in this dataset?
Does the pattern hold within regions?
Is 2008 distorting the results?
(The Global Financial Crisis caused severe GDP contractions, particularly in high-income countries.)
Main Variable: Binary, whether
gdp_growth_annual_percent is above or below its median
(“high growth” = success)
Group A: Countries with exports ≥ median exports share of GDP (“High Exports”)
Group B: Countries with exports < median (“Low Exports”)
\[H_0: P(\text{high growth} \mid \text{high exports}) = P(\text{high growth} \mid \text{low exports})\] \[H_1: P(\text{high growth} \mid \text{high exports}) \neq P(\text{high growth} \mid \text{low exports})\]
median_exports <- median(df$exports_of_goods_and_services_percent_of_gdp,
na.rm = TRUE)
df$high_exports <- ifelse(
df$exports_of_goods_and_services_percent_of_gdp >= median_exports,
"High Exports",
"Low Exports"
)
df$high_exports <- as.factor(df$high_exports)
median_growth <- median(df$gdp_growth_annual_percent, na.rm = TRUE)
df$high_growth <- ifelse(
df$gdp_growth_annual_percent >= median_growth,
"High Growth",
"Low Growth"
)
contingency_table <- table(df$high_exports, df$high_growth)
print(contingency_table)
##
## High Growth Low Growth
## High Exports 372 425
## Low Exports 416 380
# Use chi-squared if all expected counts >= 5, else Fisher's Exact
expected_counts <- chisq.test(contingency_table)$expected
use_fisher <- any(expected_counts < 5)
if (use_fisher) {
cat("\nUsing Fisher's Exact Test (some expected counts < 5)\n")
h2_result <- fisher.test(contingency_table)
} else {
cat("\nUsing Chi-Squared Test\n")
h2_result <- chisq.test(contingency_table)
}
##
## Using Chi-Squared Test
print(h2_result)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 4.7508, df = 1, p-value = 0.02928
cat("p-value:", round(h2_result$p.value, 5), "\n\n")
## p-value: 0.02928
if (h2_result$p.value < 0.05) {
cat("The p-value is below 0.05. Under Fisher's framework, this is strong evidence against the null hypothesis.\n",
"The data suggest that export intensity and GDP growth classification are NOT independent.\n")
} else {
cat("The p-value is above 0.05. Under Fisher's framework, the data do not provide strong evidence against the null.\n",
"We cannot confidently claim export intensity is associated with GDP growth classification.\n")
}
## The p-value is below 0.05. Under Fisher's framework, this is strong evidence against the null hypothesis.
## The data suggest that export intensity and GDP growth classification are NOT independent.
The Chi-squared test indicates a statistically significant association between export intensity and GDP growth classification (χ²(1) = 4.75, p = 0.029). Since the p-value is below 0.05, we reject the null hypothesis of independence under Fisher’s framework. This suggests that countries with different export intensities are not equally likely to experience above-median GDP growth, indicating a meaningful relationship between trade openness and economic performance.
df_clean <- df[!is.na(df$high_exports) & !is.na(df$high_growth), ]
prop_table <- df_clean |>
group_by(high_exports, high_growth) |>
summarise(n = n(), .groups = "drop") |>
group_by(high_exports) |>
mutate(prop = n / sum(n))
ggplot(prop_table, aes(x = high_exports, y = prop, fill = high_growth)) +
geom_col() +
geom_text(aes(label = paste0(round(prop * 100, 1), "%")),
position = position_stack(vjust = 0.5),
color = "white", fontface = "bold", size = 4) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Percentage of High Growth by Export Intensity",
x = "Export Group",
y = "Percentage",
fill = "Growth Category") +
theme_minimal()
The chart shows that 52.3% of low-export countries experienced high growth, compared to 46.7% of high-export countries. Although the difference is modest, the Chi-square test confirms that this variation is statistically significant (p = 0.029). This suggests export intensity and GDP growth classification are related, though the effect appears small in magnitude.
Is there a nonlinear threshold?
Is the relationship nonlinear, do extremely high export shares reduce growth?
Does export composition (manufacturing vs commodities) matter?
Would the results change if we used mean growth instead of median-based classification?