Statistics-Final Project-World Bank

Load required packages

library(readr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(corrplot)
library(janitor)
library(effsize)
library(scales)
library(broom)

Load the World Bank Dataset

#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..')) 
world_bank

## # A tibble: 1,675 × 19
##     Time `Time Code` `Country Name` `Country Code` Region         `Income Group`
##    <dbl> <chr>       <chr>          <chr>          <chr>          <chr>         
##  1  2000 YR2000      Brazil         BRA            Latin America… Upper middle …
##  2  2000 YR2000      China          CHN            East Asia & P… Upper middle …
##  3  2000 YR2000      France         FRA            Europe & Cent… High income   
##  4  2000 YR2000      Germany        DEU            Europe & Cent… High income   
##  5  2000 YR2000      India          IND            South Asia     Lower middle …
##  6  2000 YR2000      Indonesia      IDN            East Asia & P… Upper middle …
##  7  2000 YR2000      Italy          ITA            Europe & Cent… High income   
##  8  2000 YR2000      Japan          JPN            East Asia & P… High income   
##  9  2000 YR2000      Korea, Rep.    KOR            East Asia & P… High income   
## 10  2000 YR2000      Mexico         MEX            Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## #   `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## #   `Unemployment, total (% of total labor force)` <dbl>,
## #   `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## #   `Population, total` <dbl>,
## #   `Exports of goods and services (% of GDP)` <dbl>, …

dim(world_bank)

## [1] 1675   19

# Check column data types
glimpse(world_bank)

## Rows: 1,675
## Columns: 19
## $ Time                                                          <dbl> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…

# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)

# Clean column names
df <- world_bank |> clean_names()
glimpse(df)

## Rows: 1,675
## Columns: 19
## $ time                                                            <int> 2000, …
## $ time_code                                                       <chr> "YR200…
## $ country_name                                                    <chr> "Brazi…
## $ country_code                                                    <chr> "BRA",…
## $ region                                                          <chr> "Latin…
## $ income_group                                                    <chr> "Upper…
## $ gdp_constant_2015_us                                            <dbl> 1.1864…
## $ gdp_growth_annual_percent                                       <dbl> 4.3879…
## $ gdp_current_us                                                  <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force                 <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent                        <dbl> 7.0441…
## $ labor_force_total                                               <dbl> 802950…
## $ population_total                                                <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp                    <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp                    <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp            <dbl> 5.0339…
## $ gross_savings_percent_of_gdp                                    <dbl> 13.991…
## $ current_account_balance_percent_of_gdp                          <dbl> -4.047…

1. Audience

This analysis is designed for international economic policymakers and global financial organizations (e.g., World Bank analysts or economic advisors) who are interested in understanding how economic growth patterns differ across countries, income groups and regions.

The goal is to support data-driven decisions related to economic development strategies and investment prioritization.

2. Objective

The primary objective of this project is to analyze the relationship between economic growth, GDP size, and trade patterns across countries.

Specifically, this project aims to answer:

How does GDP growth vary across income groups?
Do wealthier countries grow differently than developing economies?
How do trade indicators like exports relate to economic performance?
How do unemployment and macroeconomic indicators correlate with income level?

The ultimate goal is to derive insights that can inform economic policy and development strategies.

3. Data Description

The dataset is sourced from the World Bank’s World Development Indicators and includes country-level economic metrics over time.

Key variables used in this analysis include:

GDP (constant 2015 US$)
GDP growth (annual %)
Population, Total
Exports of goods and services (% of GDP)
Unemployment (% of Labor Force)
Inflation, consumer prices (annual %)
Gross Savings (% of GDP)

The dataset spans multiple countries, years and income groups allowing for cross-sectional and time-series analysis of global economic trends.

Countries under different Income Groups

df |>
  filter(time == "2024") |>
  arrange(income_group, country_name) |>
  select(income_group, country_name)

## # A tibble: 67 × 2
##    income_group country_name
##    <chr>        <chr>       
##  1 High income  Australia   
##  2 High income  Bulgaria    
##  3 High income  Canada      
##  4 High income  Chile       
##  5 High income  Costa Rica  
##  6 High income  Finland     
##  7 High income  France      
##  8 High income  Germany     
##  9 High income  Israel      
## 10 High income  Italy       
## # ℹ 57 more rows

4. Exploratory Data Analysis

4.1. GDP Level vs Growth (Scatter Plot)

This scatterplot compares countries’ total GDP (constant 2015 US$) with their average annual GDP growth, colored by income group and sized by population.

# Prepare data for scatter plot- mean of columns
scatter_data <- df |>
  group_by(country_name, income_group) |>
  summarise(
    Avg_GDP_Growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
    GDP_Constant_2015 = mean(gdp_constant_2015_us, na.rm = TRUE),
    Population = mean(population_total, na.rm = TRUE),
    .groups = "drop"
  )

ggplot(scatter_data, aes(x = Avg_GDP_Growth, y = GDP_Constant_2015, color = income_group, size = Population)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "GDP Level vs Average GDP Growth",
    x = "Average GDP Growth (Annual %)",
    y = "GDP (Constant 2015 US$)",
    size = "Population",
    color = "Income Group"
  ) +
  theme_minimal()

Insight:

High-income countries (e.g., United States) dominate in total GDP but tend to exhibit lower growth rates, while lower-income countries often show higher growth. This suggests a potential convergence effect, where developing economies grow faster than developed ones.

4.2. GDP Growth Trends Over Time (Line Chart)

This line chart shows how GDP growth has evolved over time across different income groups.

# Prepare data: average GDP growth per year per income group
line_data <- df |>
  group_by(time, income_group) |>
  summarise(
    avg_gdp_growth = mean(gdp_growth_annual_percent, na.rm = TRUE),
    .groups = "drop"
  )

ggplot(line_data, aes(x = time, y = avg_gdp_growth, color = income_group)) +
  geom_line(size = 1) +
  labs(
    title = "GDP Growth Trends Over Time by Income Group",
    x = "Year",
    y = "Average GDP Growth (%)",
    color = "Income Group"
  ) +
  theme_minimal()

Insight:

The line chart displays the trend in average annual GDP growth, where high-income countries consistently exhibit the lowest growth rates, while low-income countries show the highest. Despite these differences, the overall downward trend across all income groups suggests global economic growth has slowed over time, particularly among more developed economies. A noticeable dip around 2020 across all groups indicates a global economic shock affecting all economies.

4.3. Export Patterns by Income Group (Bar Chart)

This bar chart compares the average exports (% of GDP) across income groups.

bar_data <- df |>
  ungroup() |>
  group_by(income_group) |>
  summarise(
    avg_exports = mean(exports_of_goods_and_services_percent_of_gdp, na.rm = TRUE),
    .groups = "drop"
  )

ggplot(bar_data, aes(x = reorder(income_group, avg_exports), 
                     y = avg_exports, 
                     fill = income_group)) +
  geom_col() +
  geom_text(aes(label = round(avg_exports, 2)),  
            vjust = -0.5,                        
            size = 3.5) +
  labs(
    title = "Average Exports (% of GDP) by Income Group",
    x = "Income Group",
    y = "Average Exports (% of GDP)",
    fill = "Income Group"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Insight:

High-income countries have the highest export share, indicating strong global trade integration. Upper middle-income countries also show significant export activity, while lower middle-income countries lag behind. Low-income countries have the lowest export percentages, suggesting limited participation in international trade. This highlights a clear gap in trade capacity across income groups.

4.4 Unemployment Variation by Income Group

ggplot(df, aes(x = income_group, 
               y = unemployment_total_percent_of_total_labor_force, 
               fill = income_group)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Unemployment Distribution by Income Group",
    x = "Income Group",
    y = "Unemployment (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Insight:

High-income countries are mostly clustered around about 6% unemployment, but they also have some high outliers above 25%, showing rare but severe spikes in joblessness despite overall stability.

Low-income countries have the lowest median unemployment (around 3.5%), which is likely because many people work in informal or subsistence jobs that are not counted in official unemployment rates.

Upper-middle-income countries show the widest range (about 3% to 11%), meaning unemployment varies a lot across them, likely due to ongoing economic and structural changes.

4.5 Inflation Variation by Income Group

ggplot(df, aes(x = income_group, 
               y = inflation_consumer_prices_annual_percent, 
               fill = income_group)) +
  geom_boxplot(alpha = 0.7) +
  coord_cartesian(ylim = c(0, 30)) +   # adjust range as needed
  labs(
    title = "Inflation Distribution by Income Group",
    x = "Income Group",
    y = "Inflation (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Insights:

High-income countries have the lowest and most stable inflation, with a median around 2–3%, reflecting strong central banks and stable monetary policies.

Low-income and lower-middle-income countries both have higher median inflation (around 6%) and large upper outliers reaching up to 30%+, showing they are more exposed to shocks, currency fluctuations, and weaker policy control.

Upper-middle-income countries fall in between, with a moderate median inflation (~5%) and a more contained spread, suggesting improving price stability but still not as steady as high-income countries.

Assumptions

The World Bank WDI dataset is assumed to be reliable and consistently collected across countries, though minor reporting differences may exist.
Missing values are assumed to be random or limited enough that they do not substantially bias the overall analysis, although some distortion is still possible.
World Bank income group classifications are assumed to be a reasonable way to represent a country’s level of economic development, even though countries within the same group can still be quite different.
Averaging values across years is assumed to provide a meaningful representation of long-term structural trends, while smoothing short-term fluctuations.
The analysis includes only countries with sufficiently complete data for key numerical variables; this selection may introduce sample bias toward better-documented or higher-income countries.

5. Analysis and Support

5.1 Research Question 1:

Do high-income countries have different GDP growth rates than other countries?

Hypotheisis Test (Neyman-Pearson Framework)

Main Variable: gdp_growth_annual_percent (continuous)

Grouping Variable: income_group

Let:

Group A = High income countries

Group B = Middle & Low income countries

df$income_binary <- ifelse(df$income_group == "High income",
                           "High income",
                           "Non High income")

df$income_binary <- as.factor(df$income_binary)

table(df$income_binary)

## 
##     High income Non High income 
##             600            1075

Null and Alternative Hypotheses

\[H_0: \mu_{\text{High Income}} = \mu_{\text{Non-High Income}}\] \[H_1: \mu_{\text{High Income}} \neq \mu_{\text{Non-High Income}}\]

Two Sample t-test

α = 0.05

Reason : Standard in economics. False positive (claiming difference when none exists) is moderately costly but acceptable at 5%.
Power (1 − β) = 0.8

Reason : We want 80% probability of detecting a meaningful difference.
Minimum Effect Size (Coehn’s d) = 0.3

Reason : A small-to-moderate difference in GDP growth (around 1 percentage point) is economically meaningful at the macro level. Even small growth differences compound over time.

Perform t-test

t_test1 <- t.test(gdp_growth_annual_percent ~ income_binary,
                  data = df,
                  var.equal = FALSE)

t_test1

## 
##  Welch Two Sample t-test
## 
## data:  gdp_growth_annual_percent by income_binary
## t = -11.271, df = 1593.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High income and group Non High income is not equal to 0
## 95 percent confidence interval:
##  -2.439632 -1.716378
## sample estimates:
##     mean in group High income mean in group Non High income 
##                      2.565952                      4.643957

The mean GDP growth rate for: High-income countries: 2.57%

Non-high-income countries: 4.64%

Because the p-value is far below the chosen significance level (α = 0.05), we reject the null hypothesis that the two groups have equal mean GDP growth rates.

Effect size (Cohen’s d)

cohen.d(gdp_growth_annual_percent ~ income_binary, data = df)

## 
## Cohen's d
## 
## d estimate: -0.5206472 (medium)
## 95 percent confidence interval:
##      lower      upper 
## -0.6221788 -0.4191155

This represents a medium effect size, indicating that the difference is not only statistically significant but also practically meaningful.

Interpretation

alpha <- 0.05
result_text <- ifelse(t_test1$p.value < alpha,
  paste0("We **reject** the null hypothesis (p = ", round(t_test1$p.value, 4), " < α = ", alpha, ")."),
  paste0("We **fail to reject** the null hypothesis (p = ", round(t_test1$p.value, 4), " ≥ α = ", alpha, ")."))
cat(result_text)

## We **reject** the null hypothesis (p = 0 < α = 0.05).

The negative sign of the test statistic and effect size indicates that high-income countries experience, on average, lower GDP growth rates compared to middle- and low-income countries. Economically, this finding is consistent with growth theory as developing economies often grow faster due to industrial expansion, capital accumulation, and structural transformation.High-income economies tend to grow more slowly because they are already near the technological and productivity frontier.

Visualization 1 :

ggplot(df, aes(x = income_binary,
               y = gdp_growth_annual_percent)) +
  geom_boxplot() +
  labs(title = "GDP Growth by Income Group",
       x = "Income Group",
       y = "GDP Growth (%)")

The boxplot shows that non-high-income countries have a higher median GDP growth rate than high-income countries. Non-high-income economies also display greater variability and more extreme growth outliers, indicating more volatile growth patterns. In contrast, high-income countries exhibit lower but more stable growth rates.

5.2 Research Question 2 :

Is export intensity associated with the likelihood of achieving above-median GDP growth across countries?

Hyoptheisis Test: Fisher’s Significance Testing

Main Variable: Binary, whether gdp_growth_annual_percent is above or below its median (“high growth” = success)
Group A: Countries with exports ≥ median exports share of GDP (“High Exports”)
Group B: Countries with exports < median (“Low Exports”)

Null and Alternative Hypotheses

\[H_0: P(\text{high growth} \mid \text{high exports}) = P(\text{high growth} \mid \text{low exports})\] \[H_1: P(\text{high growth} \mid \text{high exports}) \neq P(\text{high growth} \mid \text{low exports})\]

Groups

median_exports <- median(df$exports_of_goods_and_services_percent_of_gdp, 
                         na.rm = TRUE)

df$high_exports <- ifelse(
  df$exports_of_goods_and_services_percent_of_gdp >= median_exports,
  "High Exports",
  "Low Exports"
)

df$high_exports <- as.factor(df$high_exports)

Binary Growth Variable

median_growth <- median(df$gdp_growth_annual_percent, na.rm = TRUE)

df$high_growth <- ifelse(
  df$gdp_growth_annual_percent >= median_growth,
  "High Growth",
  "Low Growth"
)

Check Counts

contingency_table <- table(df$high_exports, df$high_growth)
print(contingency_table)

##               
##                High Growth Low Growth
##   High Exports         372        425
##   Low Exports          416        380

Chi-Squared Test

# Use chi-squared if all expected counts >= 5, else Fisher's Exact
expected_counts <- chisq.test(contingency_table)$expected
use_fisher <- any(expected_counts < 5)

if (use_fisher) {
  cat("\nUsing Fisher's Exact Test (some expected counts < 5)\n")
  h2_result <- fisher.test(contingency_table)
} else {
  cat("\nUsing Chi-Squared Test\n")
  h2_result <- chisq.test(contingency_table)
}

## 
## Using Chi-Squared Test

print(h2_result)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 4.7508, df = 1, p-value = 0.02928

Interpretation

cat("p-value:", round(h2_result$p.value, 5), "\n\n")

## p-value: 0.02928

if (h2_result$p.value < 0.05) {
  cat("The p-value is below 0.05. Under Fisher's framework, this is strong evidence against the null hypothesis.\n",
      "The data suggest that export intensity and GDP growth classification are NOT independent.\n")
} else {
  cat("The p-value is above 0.05. Under Fisher's framework, the data do not provide strong evidence against the null.\n",
      "We cannot confidently claim export intensity is associated with GDP growth classification.\n")
}

## The p-value is below 0.05. Under Fisher's framework, this is strong evidence against the null hypothesis.
##  The data suggest that export intensity and GDP growth classification are NOT independent.

The Chi-squared test indicates a statistically significant association between export intensity and GDP growth classification (χ²(1) = 4.75, p = 0.029). Since the p-value is below 0.05, we reject the null hypothesis of independence under Fisher’s framework. This suggests that countries with different export intensities are not equally likely to experience above-median GDP growth, indicating a meaningful relationship between trade openness and economic performance.

Visualization 2 : Export vs GDP Growth

df_clean <- df[!is.na(df$high_exports) & !is.na(df$high_growth), ]
prop_table <- df_clean |>
  group_by(high_exports, high_growth) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(high_exports) |>
  mutate(prop = n / sum(n))

ggplot(prop_table, aes(x = high_exports, y = prop, fill = high_growth)) +
  geom_col() +
  geom_text(aes(label = paste0(round(prop * 100, 1), "%")),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 4) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Percentage of High Growth by Export Intensity",
       x = "Export Group",
       y = "Percentage",
       fill = "Growth Category") +
  theme_minimal()

The chart shows that 52.3% of low-export countries experienced high growth, compared to 46.7% of high-export countries. Although the difference is modest, the Chi-square test confirms that this variation is statistically significant (p = 0.029). This suggests export intensity and GDP growth classification are related, though the effect appears small in magnitude.

5.3 Research Question 3:

Is there a relationship between gross savings (as a percentage of GDP) and GDP growth across countries?

Linear Regression Model

# Using data from latest datayear
wdi_clean <- df |>
  filter(time == "2024")|>
  select(
    country_name,
    income_group,
    gdp_growth_annual_percent,
    exports_of_goods_and_services_percent_of_gdp,
    inflation_consumer_prices_annual_percent,
    population_total,
    gross_savings_percent_of_gdp
  ) |>
  drop_na()

lm_model <- lm(
  gdp_growth_annual_percent ~ gross_savings_percent_of_gdp,
  data = wdi_clean
)

summary(lm_model)

## 
## Call:
## lm(formula = gdp_growth_annual_percent ~ gross_savings_percent_of_gdp, 
##     data = wdi_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7063 -1.5161  0.0876  1.3936  6.1868 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                   0.58657    0.86787   0.676  0.50195   
## gross_savings_percent_of_gdp  0.10046    0.03469   2.896  0.00541 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.305 on 55 degrees of freedom
## Multiple R-squared:  0.1323, Adjusted R-squared:  0.1166 
## F-statistic: 8.389 on 1 and 55 DF,  p-value: 0.005409

Model Equation

Linear Regression Form

\[ \text{GDP Growth}_i = 0.5866 + 0.1005 \cdot (\text{Gross Savings}_i) + \epsilon_i \]

Interpretation

The linear regression model examines the relationship between gross savings (% of GDP) and GDP growth. The intercept (0.587) represents the predicted GDP growth when the gross savings rate is zero, although this value is mostly a baseline and may not have strong practical meaning in this context. The coefficient for gross_savings_percent_of_gdp (0.10046) indicates that for every 1 percentage point increase in gross savings as a share of GDP, the GDP growth rate is expected to increase by approximately 0.10 percentage points, holding other factors constant.

Evaluation

The p-value for gross savings (0.00541) is less than 0.05, indicating that the relationship between gross savings and GDP growth is statistically significant. This suggests that higher savings rates are associated with higher economic growth. The R² value of 0.132 means that gross savings explain about 13.2% of the variation in GDP growth. While this shows some explanatory power, it also suggests that many other factors such as investment, trade, labor markets, and policy conditions also influence economic growth.

5.4 Research Question 4:

Does the relationship between gross savings and GDP growth remain significant after accounting for inflation and export intensity?

New Variables

inflation_consumer_prices_annual_percent (continuous)

Inflation is included as an indicator of macroeconomic stability. High or volatile inflation can reduce purchasing power, create uncertainty, and discourage investment, potentially slowing economic growth.
gross_savings_percent_of_gdp (continuous)

Gross savings is included because it reflects the amount of resources available for investment in an economy. Higher savings can finance capital formation and infrastructure, which are key drivers of economic growth.
exports_of_goods_and_services_percent_of_gdp (continuous)

Exports are included to capture a country’s level of trade openness. Economies that are more integrated into global markets may experience higher growth due to increased demand, specialization, and efficiency gains.
income_group (categorical)

Linear Model

lm_model2 <- lm(
  gdp_growth_annual_percent ~ 
    gross_savings_percent_of_gdp +
    inflation_consumer_prices_annual_percent +
    exports_of_goods_and_services_percent_of_gdp,
  data = wdi_clean
)
summary(lm_model2)

## 
## Call:
## lm(formula = gdp_growth_annual_percent ~ gross_savings_percent_of_gdp + 
##     inflation_consumer_prices_annual_percent + exports_of_goods_and_services_percent_of_gdp, 
##     data = wdi_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6469 -1.5431  0.1496  1.3035  6.0281 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   1.009729   0.911967   1.107
## gross_savings_percent_of_gdp                  0.098257   0.036285   2.708
## inflation_consumer_prices_annual_percent     -0.016359   0.010493  -1.559
## exports_of_goods_and_services_percent_of_gdp -0.006132   0.012069  -0.508
##                                              Pr(>|t|)   
## (Intercept)                                   0.27321   
## gross_savings_percent_of_gdp                  0.00909 **
## inflation_consumer_prices_annual_percent      0.12495   
## exports_of_goods_and_services_percent_of_gdp  0.61351   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.294 on 53 degrees of freedom
## Multiple R-squared:  0.1718, Adjusted R-squared:  0.1249 
## F-statistic: 3.664 on 3 and 53 DF,  p-value: 0.01788

Model Equation

Linear Regression Form

\[ \text{GDP Growth}_i = 1.0097 + 0.0983 \cdot (\text{Gross Savings}_i) - 0.0164 \cdot (\text{Inflation}_i) - 0.0061 \cdot (\text{Exports}_i) + \epsilon_i \]

Interpretation:

The regression results show that gross savings has a positive and statistically significant effect on GDP growth (p = 0.009), indicating that higher savings are associated with higher economic growth. Inflation and exports have negative coefficients, but their effects are not statistically significant, suggesting limited evidence of their impact in this model. The overall model is statistically significant (p = 0.0179), meaning at least one predictor contributes to explaining GDP growth. However, the R² of 0.1718 indicates that the model explains about 17% of the variation, implying other important factors are not included.

Visualization 3 : Actual vs predicted plot

wdi_clean$pred <- predict(lm_model2)

ggplot(wdi_clean, aes(x = pred, y = gdp_growth_annual_percent)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  labs(title = "Actual vs Predicted GDP Growth",
       x = "Predicted",
       y = "Actual")

The plot shows the relationship between actual and predicted GDP growth values from the regression model. Points closer to the red 45-degree line indicate more accurate predictions, while larger deviations reflect prediction errors. Overall, the model captures general trends but shows noticeable dispersion, suggesting moderate predictive accuracy.

Multicollinearlity Check

library(car)
vif(lm_model2)

##                 gross_savings_percent_of_gdp 
##                                     1.104725 
##     inflation_consumer_prices_annual_percent 
##                                     1.037578 
## exports_of_goods_and_services_percent_of_gdp 
##                                     1.113485

All VIF values are close to 1, indicating very low multicollinearity among the predictors. This suggests that the variables are not highly correlated and can be reliably included in the model.

Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_model2)

Interpretation:

Residuals vs Fitted

The residuals are fairly randomly scattered around zero, suggesting that the linearity assumption is reasonably satisfied. There is no strong visible pattern, although slight clustering may indicate minor model misspecification.

Normal Q-Q

Most points lie close to the reference line, indicating that residuals are approximately normally distributed. Some deviation at the extremes suggests mild non-normality, but not severe.

Scale-Location

The spread of residuals appears relatively constant, although there is a slight downward trend. This suggests mild heteroscedasticity, but the issue does not appear severe.

Residuals vs Leverage

Most observations have low leverage, with a few moderate points but none exceeding Cook’s distance threshold. This indicates that there are no highly influential outliers significantly affecting the model.

Residuals vs Xvariables

# Add residuals to dataset
wdi_clean$residuals <- resid(lm_model2)

# Residuals vs Savings
ggplot(wdi_clean, aes(x = gross_savings_percent_of_gdp, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Gross Savings", x = "Gross Savings (% GDP)", y = "Residuals")

# Residuals vs Inflation
ggplot(wdi_clean, aes(x = inflation_consumer_prices_annual_percent, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Inflation", x = "Inflation (%)", y = "Residuals")

# Residuals vs Exports
ggplot(wdi_clean, aes(x = exports_of_goods_and_services_percent_of_gdp, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Exports", x = "Exports (% GDP)", y = "Residuals")

In figure 1, residuals are randomly scattered around zero, indicating the linearity assumption is satisfied. The spread is fairly constant, suggesting no strong heteroscedasticity.
In figure 2, residuals show no clear pattern, but most data is concentrated at low inflation with a few extreme outliers. This suggests possible skewness and that inflation may not be well captured linearly.
In figure 3, residuals are generally centered around zero with no strong curvature, supporting linearity. Some uneven spread and outliers suggest mild heteroscedasticity.

Correlation Heatmap

numeric_data <- wdi_clean %>%
  select(
    gdp_growth_annual_percent,
    gross_savings_percent_of_gdp,
    inflation_consumer_prices_annual_percent,
    exports_of_goods_and_services_percent_of_gdp
  )
cor_matrix <- cor(numeric_data)

labels <- c("GDP growth", "Gross savings", "Inflation", "Exports % GDP")
colnames(cor_matrix) <- rownames(cor_matrix) <- labels

# Plot
corrplot(cor_matrix,
  method      = "color",
  addCoef.col = "black",
  tl.col      = "black",
  tl.srt      = 45,
  tl.cex      = 0.85,
  number.cex  = 0.8
)

The strongest relationship is between GDP growth and gross savings (r = 0.36), suggesting countries that save more tend to grow faster. Inflation has a weak negative correlation with both GDP growth (−0.24) and gross savings (−0.14), indicating higher inflation slightly dampens both. Exports % GDP shows very weak correlations with everything, implying it operates largely independently of the other three variables.

Residual Histogram

ggplot(wdi_clean, aes(x = residuals)) +
  geom_histogram(bins = 20) +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency")

The residuals are roughly bell-shaped and centered near zero, which is a good sign that the linear regression assumptions are reasonably met. However, the distribution is slightly irregular with some gaps and a few outliers beyond ±4, suggesting mild non-normality. Overall the model’s errors are acceptably distributed but not perfectly normal.

Cook’s Distance Plot

cooks_d <- cooks.distance(lm_model2)

plot(cooks_d, type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4/length(cooks_d), col = "red")

Observation ~21 has a Cook’s D of ~0.52, far exceeding the red threshold line (~0.07), flagging it as a highly influential point that could be distorting the regression coefficients. Observation ~56 also crosses the threshold slightly and warrants attention. The vast majority of observations sit well below the cutoff, meaning one or two data points are driving most of the leverage concern.

Final Model Evaluation

The multiple regression model improves upon the simple model by incorporating additional macroeconomic variables, with gross savings emerging as the only statistically significant predictor. While inflation and exports were theoretically relevant, they did not show significant effects in this dataset, suggesting their impact may be context-dependent or captured indirectly. Diagnostic plots indicate that model assumptions are largely satisfied, with only minor concerns such as slight heteroscedasticity and non-normality. The interaction model further reveals that the relationship between savings and growth differs across income groups, significantly improving model fit. Overall, the analysis highlights that economic growth is influenced by multiple factors, but additional variables such as labor force or investment may be needed for a more comprehensive model.

6. Conclusions

This analysis provides evidence that economic growth patterns differ significantly across countries based on income level, structural factors, and macroeconomic conditions.
High-income countries experience significantly lower GDP growth rates compared to middle- and low-income countries, supporting the economic theory of convergence, where developing economies grow faster through industrialization, capital accumulation, and structural transformation.
The association between export intensity and GDP growth classification is statistically significant, but the effect size is modest, suggesting that trade openness alone is not a dominant driver of whether a country achieves high growth.
Regression analysis highlights gross savings as a consistent and significant predictor of GDP growth, higher savings rates are associated with higher economic growth, indicating the importance of domestic resource mobilization for investment and long-term development.
Inflation and export share were not statistically significant in the multivariate model, suggesting their effects may be indirect, context-specific, or captured through other economic channels.
Overall, while structural factors like income level explain broad growth differences, internal economic capacity, particularly savings and investment, plays a more direct role in driving growth outcomes.

7. Recommendations

Based on the findings, the following recommendations are proposed for policymakers and international economic organizations:

Boost Savings and Investment:

Since savings is the strongest predictor of growth, governments should strengthen financial systems, encourage household savings, and expand access to investment. More savings means more capital for infrastructure and long-term development.

Support Structural Growth in Poorer Countries

Lower and middle-income countries grow faster but less steadily. Governments should focus on building stronger institutions, diversifying their economies, and improving governance to make growth more stable and sustainable.

Don’t Rely Only on Exports

Export intensity is linked to growth, but the effect is small. Trade promotion alone is not enough, countries need a balanced approach that also includes domestic investment, industrial policy, and education.

Keep Inflation Under Control

Inflation was not significant in the model, but low-income countries still show signs of price instability. Stable monetary policy reduces uncertainty and creates a better environment for investment and growth.

Collect Better Data

The models explain only a limited share of growth variation, meaning key factors like education, labor productivity, and institutional quality are missing. Better data collection and monitoring would lead to more accurate analysis and smarter policy decisions.

Slides

View Slide Deck

Statistics-Final Project-World Bank

Mohid

1. Audience

2. Objective

3. Data Description

4. Exploratory Data Analysis

4.1. GDP Level vs Growth (Scatter Plot)

4.2. GDP Growth Trends Over Time (Line Chart)

4.3. Export Patterns by Income Group (Bar Chart)

4.4 Unemployment Variation by Income Group

4.5 Inflation Variation by Income Group

Assumptions

5. Analysis and Support

5.1 Research Question 1:

Do high-income countries have different GDP growth rates than other countries?

Hypotheisis Test (Neyman-Pearson Framework)

Null and Alternative Hypotheses

Perform t-test

Effect size (Cohen’s d)

Interpretation

Visualization 1 :

5.2 Research Question 2 :

Is export intensity associated with the likelihood of achieving above-median GDP growth across countries?

Hyoptheisis Test: Fisher’s Significance Testing

Null and Alternative Hypotheses

Groups

Binary Growth Variable

Check Counts

Chi-Squared Test

Interpretation

Visualization 2 : Export vs GDP Growth

5.3 Research Question 3:

Is there a relationship between gross savings (as a percentage of GDP) and GDP growth across countries?

Linear Regression Model

Model Equation

5.4 Research Question 4:

Does the relationship between gross savings and GDP growth remain significant after accounting for inflation and export intensity?

New Variables

Linear Model

Model Equation

Visualization 3 : Actual vs predicted plot

Multicollinearlity Check

Diagnostic Plots

Residuals vs Xvariables

Correlation Heatmap

Residual Histogram

Cook’s Distance Plot

Final Model Evaluation

6. Conclusions

7. Recommendations

Slides