Data Dive : Week 6

Load required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the World Bank Dataset

#fill '..' values in numerical columns with NA.
world_bank <- read_csv("C:/Users/SP KHALID/Downloads/WDI- World Bank Dataset.csv" , na = c('..'))

## Rows: 1675 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Time Code, Country Name, Country Code, Region, Income Group
## dbl (14): Time, GDP (constant 2015 US$), GDP growth (annual %), GDP (current...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

world_bank

## # A tibble: 1,675 × 19
##     Time `Time Code` `Country Name` `Country Code` Region         `Income Group`
##    <dbl> <chr>       <chr>          <chr>          <chr>          <chr>         
##  1  2000 YR2000      Brazil         BRA            Latin America… Upper middle …
##  2  2000 YR2000      China          CHN            East Asia & P… Upper middle …
##  3  2000 YR2000      France         FRA            Europe & Cent… High income   
##  4  2000 YR2000      Germany        DEU            Europe & Cent… High income   
##  5  2000 YR2000      India          IND            South Asia     Lower middle …
##  6  2000 YR2000      Indonesia      IDN            East Asia & P… Upper middle …
##  7  2000 YR2000      Italy          ITA            Europe & Cent… High income   
##  8  2000 YR2000      Japan          JPN            East Asia & P… High income   
##  9  2000 YR2000      Korea, Rep.    KOR            East Asia & P… High income   
## 10  2000 YR2000      Mexico         MEX            Latin America… Upper middle …
## # ℹ 1,665 more rows
## # ℹ 13 more variables: `GDP (constant 2015 US$)` <dbl>,
## #   `GDP growth (annual %)` <dbl>, `GDP (current US$)` <dbl>,
## #   `Unemployment, total (% of total labor force)` <dbl>,
## #   `Inflation, consumer prices (annual %)` <dbl>, `Labor force, total` <dbl>,
## #   `Population, total` <dbl>,
## #   `Exports of goods and services (% of GDP)` <dbl>, …

dim(world_bank)

## [1] 1675   19

# Check column data types
glimpse(world_bank)

## Rows: 1,675
## Columns: 19
## $ Time                                                          <dbl> 2000, 20…
## $ `Time Code`                                                   <chr> "YR2000"…
## $ `Country Name`                                                <chr> "Brazil"…
## $ `Country Code`                                                <chr> "BRA", "…
## $ Region                                                        <chr> "Latin A…
## $ `Income Group`                                                <chr> "Upper m…
## $ `GDP (constant 2015 US$)`                                     <dbl> 1.18642e…
## $ `GDP growth (annual %)`                                       <dbl> 4.387949…
## $ `GDP (current US$)`                                           <dbl> 6.554482…
## $ `Unemployment, total (% of total labor force)`                <dbl> NA, 3.70…
## $ `Inflation, consumer prices (annual %)`                       <dbl> 7.044141…
## $ `Labor force, total`                                          <dbl> 80295093…
## $ `Population, total`                                           <dbl> 17401828…
## $ `Exports of goods and services (% of GDP)`                    <dbl> 10.18805…
## $ `Imports of goods and services (% of GDP)`                    <dbl> 12.45171…
## $ `General government final consumption expenditure (% of GDP)` <dbl> 18.76784…
## $ `Foreign direct investment, net inflows (% of GDP)`           <dbl> 5.033917…
## $ `Gross savings (% of GDP)`                                    <dbl> 13.99170…
## $ `Current account balance (% of GDP)`                          <dbl> -4.04774…

# Convert Time column to integer
world_bank$Time <- as.integer(world_bank$Time)

# Clean column names
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

df <- world_bank |> clean_names()
glimpse(df)

## Rows: 1,675
## Columns: 19
## $ time                                                            <int> 2000, …
## $ time_code                                                       <chr> "YR200…
## $ country_name                                                    <chr> "Brazi…
## $ country_code                                                    <chr> "BRA",…
## $ region                                                          <chr> "Latin…
## $ income_group                                                    <chr> "Upper…
## $ gdp_constant_2015_us                                            <dbl> 1.1864…
## $ gdp_growth_annual_percent                                       <dbl> 4.3879…
## $ gdp_current_us                                                  <dbl> 6.5544…
## $ unemployment_total_percent_of_total_labor_force                 <dbl> NA, 3.…
## $ inflation_consumer_prices_annual_percent                        <dbl> 7.0441…
## $ labor_force_total                                               <dbl> 802950…
## $ population_total                                                <dbl> 174018…
## $ exports_of_goods_and_services_percent_of_gdp                    <dbl> 10.188…
## $ imports_of_goods_and_services_percent_of_gdp                    <dbl> 12.451…
## $ general_government_final_consumption_expenditure_percent_of_gdp <dbl> 18.767…
## $ foreign_direct_investment_net_inflows_percent_of_gdp            <dbl> 5.0339…
## $ gross_savings_percent_of_gdp                                    <dbl> 13.991…
## $ current_account_balance_percent_of_gdp                          <dbl> -4.047…

Numerical Variables

Pair 1

Q: Is there any relation between economic growth and unemployment levels for United States?

gdp_growth_annual_percent (explanatory variable)

unemployment_change (response variable)

df_one <- df |>
  filter(country_code == "USA") |>
  arrange(time) |>
  mutate (
    unemployment_change = unemployment_total_percent_of_total_labor_force - lag(unemployment_total_percent_of_total_labor_force)
  )
df_one

## # A tibble: 25 × 20
##     time time_code country_name  country_code region        income_group
##    <int> <chr>     <chr>         <chr>        <chr>         <chr>       
##  1  2000 YR2000    United States USA          North America High income 
##  2  2001 YR2001    United States USA          North America High income 
##  3  2002 YR2002    United States USA          North America High income 
##  4  2003 YR2003    United States USA          North America High income 
##  5  2004 YR2004    United States USA          North America High income 
##  6  2005 YR2005    United States USA          North America High income 
##  7  2006 YR2006    United States USA          North America High income 
##  8  2007 YR2007    United States USA          North America High income 
##  9  2008 YR2008    United States USA          North America High income 
## 10  2009 YR2009    United States USA          North America High income 
## # ℹ 15 more rows
## # ℹ 14 more variables: gdp_constant_2015_us <dbl>,
## #   gdp_growth_annual_percent <dbl>, gdp_current_us <dbl>,
## #   unemployment_total_percent_of_total_labor_force <dbl>,
## #   inflation_consumer_prices_annual_percent <dbl>, labor_force_total <dbl>,
## #   population_total <dbl>, exports_of_goods_and_services_percent_of_gdp <dbl>,
## #   imports_of_goods_and_services_percent_of_gdp <dbl>, …

ggplot(df_one, aes(x = gdp_growth_annual_percent,
                   y = unemployment_change)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = 'lm' , se = FALSE) +
  labs(
    title = "GDP Growth vs Unemployment Change"
  )

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

According to the economic theory, i.e Okun’s law, there is a negative coorelation between gdp growth and unemployment change as confirmed from the graph. When gdp growth rate is higher, unemployment change goes down (negative), indicating strong inverse relationship. Outliers: There appears to be one extreme negative GDP growth observation, likely a recession year, and one high positive GDP growth observation. These points strengthen the negative trend but may exert influence on correlation.

Correlation Test (Pearson Method)

cor.test(df_one$gdp_growth_annual_percent,
         df_one$unemployment_change,
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  df_one$gdp_growth_annual_percent and df_one$unemployment_change
## t = -8.8083, df = 22, p-value = 1.152e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9483755 -0.7442675
## sample estimates:
##       cor 
## -0.882659

r = -0.8827

95%CI: (-0.9484, -0.7443)

p-value = 1.152e-08 (extremely small i.e statistically significant)

The Pearson correlation coefficient between GDP growth and unemployment change is -0.883, indicating a strong negative linear relationship. This suggests that years with higher economic growth are strongly associated with decreases in unemployment. The 95% confidence interval (-0.948, -0.744) confirms that the true population correlation is strongly negative.

Why this makes sense

When economic output grows faster, unemployment tends to fall because organizations expand production during growth and labor demand increases, due to which unemployment declines.

Confidence Interval

t.test(df_one$unemployment_change)

## 
##  One Sample t-test
## 
## data:  df_one$unemployment_change
## t = 0.0041487, df = 23, p-value = 0.9967
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.6220297  0.6245297
## sample estimates:
## mean of x 
##   0.00125

mean unemployment_change = 0.00125

95% CI: (-0.6220, 0.6245)

p-value = 0.9967

The 95% confidence interval for the mean unemployment change is (-0.622, 0.625). Because this interval contains zero, we fail to reject the null hypothesis that the average annual change in unemployment is zero. This suggests that, over the observed period, unemployment in the United States fluctuates around a stable long-term level rather than exhibiting a consistent upward or downward trend.

Pair 2

Q2 : How does trade balance relate with economic growth for USA?

gdp_growth_annual_percent (explanatory variable)

net_trade_percentage_gdp (response variable)

df_two <- df |>
  filter(country_code == "USA") |>
  arrange(time) |>
  mutate (
    net_trade_percentage_gdp = exports_of_goods_and_services_percent_of_gdp - imports_of_goods_and_services_percent_of_gdp
  )
df_two

## # A tibble: 25 × 20
##     time time_code country_name  country_code region        income_group
##    <int> <chr>     <chr>         <chr>        <chr>         <chr>       
##  1  2000 YR2000    United States USA          North America High income 
##  2  2001 YR2001    United States USA          North America High income 
##  3  2002 YR2002    United States USA          North America High income 
##  4  2003 YR2003    United States USA          North America High income 
##  5  2004 YR2004    United States USA          North America High income 
##  6  2005 YR2005    United States USA          North America High income 
##  7  2006 YR2006    United States USA          North America High income 
##  8  2007 YR2007    United States USA          North America High income 
##  9  2008 YR2008    United States USA          North America High income 
## 10  2009 YR2009    United States USA          North America High income 
## # ℹ 15 more rows
## # ℹ 14 more variables: gdp_constant_2015_us <dbl>,
## #   gdp_growth_annual_percent <dbl>, gdp_current_us <dbl>,
## #   unemployment_total_percent_of_total_labor_force <dbl>,
## #   inflation_consumer_prices_annual_percent <dbl>, labor_force_total <dbl>,
## #   population_total <dbl>, exports_of_goods_and_services_percent_of_gdp <dbl>,
## #   imports_of_goods_and_services_percent_of_gdp <dbl>, …

ggplot(df_two,
       aes(x = net_trade_percentage_gdp,
           y = gdp_growth_annual_percent)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Net Trade (% GDP) vs GDP Growth")

## `geom_smooth()` using formula = 'y ~ x'

The graph shows a slight downward trend for gdp growth vs net trade percentage. The points are widely scattered indicating a weak relationship. Outliers: One strong negative GDP growth year (likely recession) and some clustering of observations around moderate trade deficits.

Correlation Test (Pearson Method)

cor.test(df_two$net_trade_percentage_gdp,
         df_two$gdp_growth_annual_percent,
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  df_two$net_trade_percentage_gdp and df_two$gdp_growth_annual_percent
## t = -0.96441, df = 23, p-value = 0.3449
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5494740  0.2147101
## sample estimates:
##        cor 
## -0.1971464

r = -0.197

95% CI: (-0.549, 0.215)

p-value = 0.3449

The Pearson correlation between net trade (% of GDP) and GDP growth is -0.197, indicating a weak negative association. However, the relationship is not statistically significant (p = 0.345 > 0.05), and the 95% confidence interval (-0.549, 0.215) includes zero. Therefore, we do not have sufficient evidence to conclude that a linear relationship exists between trade balance and GDP growth over this period.

Why this makes sense

The U.S. runs a persistent trade deficit and GDP growth is driven largely by domestic consumption and investment. Trade balance fluctuations may not strongly determine short-run growth.

Confidence Interval

t.test(df_two$net_trade_percentage_gdp)

## 
##  One Sample t-test
## 
## data:  df_two$net_trade_percentage_gdp
## t = -19.12, df = 24, p-value = 4.957e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -4.087839 -3.291320
## sample estimates:
## mean of x 
##  -3.68958

Mean = -3.6896

95% CI = (-4.0878, -3.2913)

p-value = 4.957e-16

The mean net trade balance over the observed period is approximately -3.69% of GDP. The 95% confidence interval (-4.09%, -3.29%) indicates that the United States consistently runs a statistically significant trade deficit. Because zero is not contained within this interval, we conclude that the long-run average trade balance is significantly negative.

Data Dive : Week 6

Mohid

2026-02-23

Numerical Variables

Pair 1

Correlation Test (Pearson Method)

Confidence Interval

Pair 2

Correlation Test (Pearson Method)

Confidence Interval