Data Dive

Week 7

Load required packages

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Load the World Bank Dataset

#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..')) 
## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
world_bank
## # A tibble: 1,675 × 19
##     Time `Time Code` `Country Name` `Country Code` Region         `Income Group`
##    <dbl> <chr>       <chr>          <chr>          <chr>          <chr>         
##  1  2000 YR2000      Brazil         BRA            Latin America… Upper middle …
##  2  2000 YR2000      China          CHN            East Asia & P… Upper middle …
##  3  2000 YR2000      France         FRA            Europe & Cent… High income   
##  4  2000 YR2000      Germany        DEU            Europe & Cent… High income   
##  5  2000 YR2000      India          IND            South Asia     Lower middle …
##  6  2000 YR2000      Indonesia      IDN            East Asia & P… Upper middle …
##  7  2000 YR2000      Italy          ITA            Europe & Cent… High income   
##  8  2000 YR2000      Japan          JPN            East Asia & P… High income   
##  9  2000 YR2000      Korea, Rep.    KOR            East Asia & P… High income   
## 10  2000 YR2000      Mexico         MEX            Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## #   `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## #   `Unemployment, total (% of total labor force)` <dbl>,
## #   `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## #   `Population, total` <dbl>,
## #   `Exports of goods and services (% of GDP)` <dbl>, …
dim(world_bank)
## [1] 1675   19
# Check column data types
glimpse(world_bank)
## Rows: 1,675
## Columns: 19
## $ Time                                                          <dbl> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…
# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)
# Clean column names
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
df <- world_bank |> clean_names()
glimpse(df)
## Rows: 1,675
## Columns: 19
## $ time                                                            <int> 2000, …
## $ time_code                                                       <chr> "YR200…
## $ country_name                                                    <chr> "Brazi…
## $ country_code                                                    <chr> "BRA",…
## $ region                                                          <chr> "Latin…
## $ income_group                                                    <chr> "Upper…
## $ gdp_constant_2015_us                                            <dbl> 1.1864…
## $ gdp_growth_annual_percent                                       <dbl> 4.3879…
## $ gdp_current_us                                                  <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force                 <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent                        <dbl> 7.0441…
## $ labor_force_total                                               <dbl> 802950…
## $ population_total                                                <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp                    <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp                    <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp            <dbl> 5.0339…
## $ gross_savings_percent_of_gdp                                    <dbl> 13.991…
## $ current_account_balance_percent_of_gdp                          <dbl> -4.047…

Hypotheisis 1 (Neyman-Pearson Framework)

Main Variable: gdp_growth_annual_percent (continuous)

Grouping Variable: income_group

Let:

Group A = High income countries

Group B = Middle & Low income countries

df$income_binary <- ifelse(df$income_group == "High income",
                           "High income",
                           "Non High income")

df$income_binary <- as.factor(df$income_binary)

table(df$income_binary)
## 
##     High income Non High income 
##             600            1075

Research Question

Do high-income countries have different GDP growth rates than other countries?

Null and Alternative Hypotheses

\[H_0: \mu_{\text{High Income}} = \mu_{\text{Non-High Income}}\] \[H_1: \mu_{\text{High Income}} \neq \mu_{\text{Non-High Income}}\]

Two Sample t-test

  • α = 0.05

    Reason : Standard in economics. False positive (claiming difference when none exists) is moderately costly but acceptable at 5%.

  • Power (1 − β) = 0.8

    Reason : We want 80% probability of detecting a meaningful difference.

  • Minimum Effect Size (Coehn’s d) = 0.3

    Reason : A small-to-moderate difference in GDP growth (around 1 percentage point) is economically meaningful at the macro level. Even small growth differences compound over time.

Sample Size Calculation

library(pwr)
pwr_result <-pwr.t.test( d = 0.3,
            power = 0.8,
            sig.level = 0.05,
            type = "two.sample"
)
group_counts <- df |>
  filter(!is.na(gdp_growth_annual_percent)) |>
  count(income_binary)

print(group_counts)
## # A tibble: 2 × 2
##   income_binary       n
##   <fct>           <int>
## 1 High income       600
## 2 Non High income  1073
n_required <- ceiling(pwr_result$n)
cat("\nRequired per group:", n_required, "\n")
## 
## Required per group: 176
cat("Do we have enough data?", 
    all(group_counts$n >= n_required), "\n")
## Do we have enough data? TRUE

Yes, We have enough data as required n per group is less than sample size.

Perform t-test

t_test1 <- t.test(gdp_growth_annual_percent ~ income_binary,
                  data = df,
                  var.equal = FALSE)

t_test1
## 
##  Welch Two Sample t-test
## 
## data:  gdp_growth_annual_percent by income_binary
## t = -11.271, df = 1593.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High income and group Non High income is not equal to 0
## 95 percent confidence interval:
##  -2.439632 -1.716378
## sample estimates:
##     mean in group High income mean in group Non High income 
##                      2.565952                      4.643957

The mean GDP growth rate for: High-income countries: 2.57%

Non-high-income countries: 4.64%

Because the p-value is far below the chosen significance level (α = 0.05), we reject the null hypothesis that the two groups have equal mean GDP growth rates.

Effect size (Cohen’s d)

library(effsize)

cohen.d(gdp_growth_annual_percent ~ income_binary, data = df)
## 
## Cohen's d
## 
## d estimate: -0.5206472 (medium)
## 95 percent confidence interval:
##      lower      upper 
## -0.6221788 -0.4191155

This represents a medium effect size, indicating that the difference is not only statistically significant but also practically meaningful.

Interpretaion

alpha <- 0.05
result_text <- ifelse(t_test1$p.value < alpha,
  paste0("We **reject** the null hypothesis (p = ", round(t_test1$p.value, 4), " < α = ", alpha, ")."),
  paste0("We **fail to reject** the null hypothesis (p = ", round(t_test1$p.value, 4), " ≥ α = ", alpha, ")."))
cat(result_text)
## We **reject** the null hypothesis (p = 0 < α = 0.05).

The negative sign of the test statistic and effect size indicates that high-income countries experience, on average, lower GDP growth rates compared to middle- and low-income countries.Economically, this finding is consistent with growth theory as developing economies often grow faster due to industrial expansion, capital accumulation, and structural transformation.High-income economies tend to grow more slowly because they are already near the technological and productivity frontier.

Visualization 1 :

library(ggplot2)

ggplot(df, aes(x = income_binary,
               y = gdp_growth_annual_percent)) +
  geom_boxplot() +
  labs(title = "GDP Growth by Income Group",
       x = "Income Group",
       y = "GDP Growth (%)")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The boxplot shows that non-high-income countries have a higher median GDP growth rate than high-income countries. Non-high-income economies also display greater variability and more extreme growth outliers, indicating more volatile growth patterns. In contrast, high-income countries exhibit lower but more stable growth rates.

Any Further Questions

Is the gap changing over time?

Are non-high-income countries converging toward high-income growth rates, or is the gap stable across decades in this dataset?

Does the pattern hold within regions?

Is 2008 distorting the results?

(The Global Financial Crisis caused severe GDP contractions, particularly in high-income countries.)

Hyoptheisis 2 : Fisher’s Significance Testing

  • Main Variable: Binary, whether gdp_growth_annual_percent is above or below its median (“high growth” = success)

  • Group A: Countries with exports ≥ median exports share of GDP (“High Exports”)

  • Group B: Countries with exports < median (“Low Exports”)

Null and Alternative Hypotheses

\[H_0: P(\text{high growth} \mid \text{high exports}) = P(\text{high growth} \mid \text{low exports})\] \[H_1: P(\text{high growth} \mid \text{high exports}) \neq P(\text{high growth} \mid \text{low exports})\]

Groups

median_exports <- median(df$exports_of_goods_and_services_percent_of_gdp, 
                         na.rm = TRUE)

df$high_exports <- ifelse(
  df$exports_of_goods_and_services_percent_of_gdp >= median_exports,
  "High Exports",
  "Low Exports"
)

df$high_exports <- as.factor(df$high_exports)

Binary Growth Variable

median_growth <- median(df$gdp_growth_annual_percent, na.rm = TRUE)

df$high_growth <- ifelse(
  df$gdp_growth_annual_percent >= median_growth,
  "High Growth",
  "Low Growth"
)

Check Counts

contingency_table <- table(df$high_exports, df$high_growth)
print(contingency_table)
##               
##                High Growth Low Growth
##   High Exports         372        425
##   Low Exports          416        380

Chi-Squared Test

# Use chi-squared if all expected counts >= 5, else Fisher's Exact
expected_counts <- chisq.test(contingency_table)$expected
use_fisher <- any(expected_counts < 5)

if (use_fisher) {
  cat("\nUsing Fisher's Exact Test (some expected counts < 5)\n")
  h2_result <- fisher.test(contingency_table)
} else {
  cat("\nUsing Chi-Squared Test\n")
  h2_result <- chisq.test(contingency_table)
}
## 
## Using Chi-Squared Test
print(h2_result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 4.7508, df = 1, p-value = 0.02928

Interpretation

cat("p-value:", round(h2_result$p.value, 5), "\n\n")
## p-value: 0.02928
if (h2_result$p.value < 0.05) {
  cat("The p-value is below 0.05. Under Fisher's framework, this is strong evidence against the null hypothesis.\n",
      "The data suggest that export intensity and GDP growth classification are NOT independent.\n")
} else {
  cat("The p-value is above 0.05. Under Fisher's framework, the data do not provide strong evidence against the null.\n",
      "We cannot confidently claim export intensity is associated with GDP growth classification.\n")
}
## The p-value is below 0.05. Under Fisher's framework, this is strong evidence against the null hypothesis.
##  The data suggest that export intensity and GDP growth classification are NOT independent.

The Chi-squared test indicates a statistically significant association between export intensity and GDP growth classification (χ²(1) = 4.75, p = 0.029). Since the p-value is below 0.05, we reject the null hypothesis of independence under Fisher’s framework. This suggests that countries with different export intensities are not equally likely to experience above-median GDP growth, indicating a meaningful relationship between trade openness and economic performance.

Visualization 2 ; Export vs GDP Growth

df_clean <- df[!is.na(df$high_exports) & !is.na(df$high_growth), ]
prop_table <- df_clean |>
  group_by(high_exports, high_growth) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(high_exports) |>
  mutate(prop = n / sum(n))

ggplot(prop_table, aes(x = high_exports, y = prop, fill = high_growth)) +
  geom_col() +
  geom_text(aes(label = paste0(round(prop * 100, 1), "%")),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 4) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Percentage of High Growth by Export Intensity",
       x = "Export Group",
       y = "Percentage",
       fill = "Growth Category") +
  theme_minimal()

The chart shows that 52.3% of low-export countries experienced high growth, compared to 46.7% of high-export countries. Although the difference is modest, the Chi-square test confirms that this variation is statistically significant (p = 0.029). This suggests export intensity and GDP growth classification are related, though the effect appears small in magnitude.

Any Further Questions

Is there a nonlinear threshold?

Is the relationship nonlinear, do extremely high export shares reduce growth?

Does export composition (manufacturing vs commodities) matter?

Would the results change if we used mean growth instead of median-based classification?