Data Dive

Week 8

Load required packages

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Load the World Bank Dataset

#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..')) 
## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
world_bank
## # A tibble: 1,675 × 19
##     Time `Time Code` `Country Name` `Country Code` Region         `Income Group`
##    <dbl> <chr>       <chr>          <chr>          <chr>          <chr>         
##  1  2000 YR2000      Brazil         BRA            Latin America… Upper middle …
##  2  2000 YR2000      China          CHN            East Asia & P… Upper middle …
##  3  2000 YR2000      France         FRA            Europe & Cent… High income   
##  4  2000 YR2000      Germany        DEU            Europe & Cent… High income   
##  5  2000 YR2000      India          IND            South Asia     Lower middle …
##  6  2000 YR2000      Indonesia      IDN            East Asia & P… Upper middle …
##  7  2000 YR2000      Italy          ITA            Europe & Cent… High income   
##  8  2000 YR2000      Japan          JPN            East Asia & P… High income   
##  9  2000 YR2000      Korea, Rep.    KOR            East Asia & P… High income   
## 10  2000 YR2000      Mexico         MEX            Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## #   `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## #   `Unemployment, total (% of total labor force)` <dbl>,
## #   `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## #   `Population, total` <dbl>,
## #   `Exports of goods and services (% of GDP)` <dbl>, …
dim(world_bank)
## [1] 1675   19
# Check column data types
glimpse(world_bank)
## Rows: 1,675
## Columns: 19
## $ Time                                                          <dbl> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…
# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)
# Clean column names
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
df <- world_bank |> clean_names()
colnames(df)
##  [1] "time"                                                           
##  [2] "time_code"                                                      
##  [3] "country_name"                                                   
##  [4] "country_code"                                                   
##  [5] "region"                                                         
##  [6] "income_group"                                                   
##  [7] "gdp_constant_2015_us"                                           
##  [8] "gdp_growth_annual_percent"                                      
##  [9] "gdp_current_us"                                                 
## [10] "unemployment_total_percent_of_total_labor_force"                
## [11] "inflation_consumer_prices_annual_percent"                       
## [12] "labor_force_total"                                              
## [13] "population_total"                                               
## [14] "exports_of_goods_and_services_percent_of_gdp"                   
## [15] "imports_of_goods_and_services_percent_of_gdp"                   
## [16] "general_government_final_consumption_expenditure_percent_of_gdp"
## [17] "foreign_direct_investment_net_inflows_percent_of_gdp"           
## [18] "gross_savings_percent_of_gdp"                                   
## [19] "current_account_balance_percent_of_gdp"

New Variables

  • inflation_consumer_prices_annual_percent (continuous)

    Inflation is included as an indicator of macroeconomic stability. High or volatile inflation can reduce purchasing power, create uncertainty, and discourage investment, potentially slowing economic growth.

  • gross_savings_percent_of_gdp (continuous)

    Gross savings is included because it reflects the amount of resources available for investment in an economy. Higher savings can finance capital formation and infrastructure, which are key drivers of economic growth.

  • exports_of_goods_and_services_percent_of_gdp (continuous)

    Exports are included to capture a country’s level of trade openness. Economies that are more integrated into global markets may experience higher growth due to increased demand, specialization, and efficiency gains.

  • income_group (categorical)

wdi_clean <- df |>
  filter(time == "2024")|>
  select(
    country_name,
    income_group,
    gdp_growth_annual_percent,
    exports_of_goods_and_services_percent_of_gdp,
    inflation_consumer_prices_annual_percent,
    population_total,
    gross_savings_percent_of_gdp
  ) |>
  drop_na()

Linear Model

lm_model2 <- lm(
  gdp_growth_annual_percent ~ 
    gross_savings_percent_of_gdp +
    inflation_consumer_prices_annual_percent +
    exports_of_goods_and_services_percent_of_gdp,
  data = wdi_clean
)
summary(lm_model2)
## 
## Call:
## lm(formula = gdp_growth_annual_percent ~ gross_savings_percent_of_gdp + 
##     inflation_consumer_prices_annual_percent + exports_of_goods_and_services_percent_of_gdp, 
##     data = wdi_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6469 -1.5431  0.1496  1.3035  6.0281 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   1.009729   0.911967   1.107
## gross_savings_percent_of_gdp                  0.098257   0.036285   2.708
## inflation_consumer_prices_annual_percent     -0.016359   0.010493  -1.559
## exports_of_goods_and_services_percent_of_gdp -0.006132   0.012069  -0.508
##                                              Pr(>|t|)   
## (Intercept)                                   0.27321   
## gross_savings_percent_of_gdp                  0.00909 **
## inflation_consumer_prices_annual_percent      0.12495   
## exports_of_goods_and_services_percent_of_gdp  0.61351   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.294 on 53 degrees of freedom
## Multiple R-squared:  0.1718, Adjusted R-squared:  0.1249 
## F-statistic: 3.664 on 3 and 53 DF,  p-value: 0.01788

Interpretation:

The regression results show that gross savings has a positive and statistically significant effect on GDP growth (p = 0.009), indicating that higher savings are associated with higher economic growth. Inflation and exports have negative coefficients, but their effects are not statistically significant, suggesting limited evidence of their impact in this model. The overall model is statistically significant (p = 0.0179), meaning at least one predictor contributes to explaining GDP growth. However, the R² of 0.1718 indicates that the model explains about 17% of the variation, implying other important factors are not included.

Multicollinearlity Check

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(lm_model2)
##                 gross_savings_percent_of_gdp 
##                                     1.104725 
##     inflation_consumer_prices_annual_percent 
##                                     1.037578 
## exports_of_goods_and_services_percent_of_gdp 
##                                     1.113485

All VIF values are close to 1, indicating very low multicollinearity among the predictors. This suggests that the variables are not highly correlated and can be reliably included in the model.

Interaction Term

This interaction is included to examine whether the effect of savings on growth differs across income groups. Economic theory suggests that the impact of savings may be stronger in developing economies compared to developed ones.

lm_model3 <- lm(
  gdp_growth_annual_percent ~ 
    gross_savings_percent_of_gdp * income_group,
  data = wdi_clean
)
summary(lm_model3)
## 
## Call:
## lm(formula = gdp_growth_annual_percent ~ gross_savings_percent_of_gdp * 
##     income_group, data = wdi_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5530 -0.9471 -0.0180  1.0495  3.9433 
## 
## Coefficients:
##                                                               Estimate
## (Intercept)                                                   1.583898
## gross_savings_percent_of_gdp                                  0.009002
## income_groupLow income                                        0.648024
## income_groupLower middle income                              -1.821792
## income_groupUpper middle income                              -2.150628
## gross_savings_percent_of_gdp:income_groupLow income           0.224818
## gross_savings_percent_of_gdp:income_groupLower middle income  0.162431
## gross_savings_percent_of_gdp:income_groupUpper middle income  0.144685
##                                                              Std. Error t value
## (Intercept)                                                    1.478651   1.071
## gross_savings_percent_of_gdp                                   0.057567   0.156
## income_groupLow income                                         2.408601   0.269
## income_groupLower middle income                                1.885745  -0.966
## income_groupUpper middle income                                1.908860  -1.127
## gross_savings_percent_of_gdp:income_groupLow income            0.114905   1.957
## gross_savings_percent_of_gdp:income_groupLower middle income   0.072914   2.228
## gross_savings_percent_of_gdp:income_groupUpper middle income   0.076170   1.899
##                                                              Pr(>|t|)  
## (Intercept)                                                    0.2893  
## gross_savings_percent_of_gdp                                   0.8764  
## income_groupLow income                                         0.7890  
## income_groupLower middle income                                0.3387  
## income_groupUpper middle income                                0.2654  
## gross_savings_percent_of_gdp:income_groupLow income            0.0561 .
## gross_savings_percent_of_gdp:income_groupLower middle income   0.0305 *
## gross_savings_percent_of_gdp:income_groupUpper middle income   0.0634 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.789 on 49 degrees of freedom
## Multiple R-squared:  0.5343, Adjusted R-squared:  0.4678 
## F-statistic: 8.031 on 7 and 49 DF,  p-value: 1.783e-06

Interpretation:

The interaction model shows that the effect of gross savings on GDP growth varies across income groups, with some interaction terms being statistically significant. This suggests that savings may have a stronger impact on growth in certain income categories, particularly lower middle-income countries. The higher R² (0.53) indicates a substantial improvement in model fit when accounting for these differences.

Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_model2)

Interpretation:

1. Residuals vs Fitted

The residuals are fairly randomly scattered around zero, suggesting that the linearity assumption is reasonably satisfied. There is no strong visible pattern, although slight clustering may indicate minor model misspecification.

  1. Normal Q-Q

Most points lie close to the reference line, indicating that residuals are approximately normally distributed. Some deviation at the extremes suggests mild non-normality, but not severe.

  1. Scale-Location

The spread of residuals appears relatively constant, although there is a slight downward trend. This suggests mild heteroscedasticity, but the issue does not appear severe.

  1. Residuals vs Leverage

Most observations have low leverage, with a few moderate points but none exceeding Cook’s distance threshold. This indicates that there are no highly influential outliers significantly affecting the model.

Final Model Evaluation

The multiple regression model improves upon the simple model by incorporating additional macroeconomic variables, with gross savings emerging as the only statistically significant predictor. While inflation and exports were theoretically relevant, they did not show significant effects in this dataset, suggesting their impact may be context-dependent or captured indirectly. Diagnostic plots indicate that model assumptions are largely satisfied, with only minor concerns such as slight heteroscedasticity and non-normality. The interaction model further reveals that the relationship between savings and growth differs across income groups, significantly improving model fit. Overall, the analysis highlights that economic growth is influenced by multiple factors, but additional variables such as labor force or investment may be needed for a more comprehensive model.