World_Bank_Indicators

library(tidyverse)
library(stringr)
library(scales)

# Load dataset
Global_Development_Indicators <- read_csv("R-DATA/project_1_Global_Development_Indicators.csv")

# Clean data: convert ".." to NA and numeric conversion
base_clean <- Global_Development_Indicators %>%
  mutate(across(starts_with("19") | starts_with("20"), ~ifelse(.x == "..", NA, as.numeric(.x))))

# Extract and reshape indicators
gdp_data <- base_clean %>%
  filter(`Series Name` == "GDP (current US$)") %>%
  select(`Country Name`, `Country Code`, starts_with("20")) %>%
  pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "GDP_total") %>%
  mutate(Year = as.numeric(str_extract(Year, "\\d{4}"))) %>%
  filter(!is.na(GDP_total))

pop_data <- base_clean %>%
  filter(`Series Name` == "Population, total") %>%
  select(`Country Name`, `Country Code`, starts_with("20")) %>%
  pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Population") %>%
  mutate(Year = as.numeric(str_extract(Year, "\\d{4}"))) %>%
  filter(!is.na(Population))

internet_data <- base_clean %>%
  filter(`Series Name` == "Individuals using the Internet (% of population)") %>%
  select(`Country Name`, `Country Code`, starts_with("20")) %>%
  pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Internet_users_pct") %>%
  mutate(Year = as.numeric(str_extract(Year, "\\d{4}"))) %>%
  filter(!is.na(Internet_users_pct))

education_data <- base_clean %>%
  filter(`Series Name` == "Adolescents out of school (% of lower secondary school age)") %>%
  select(`Country Name`, `Country Code`, starts_with("20")) %>%
  pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Out_of_school_pct") %>%
  mutate(Year = as.numeric(str_extract(Year, "\\d{4}"))) %>%
  filter(!is.na(Out_of_school_pct))

# Combine datasets into a final analysis dataset
analysis_data <- gdp_data %>%
  left_join(pop_data, by = c("Country Name", "Country Code", "Year")) %>%
  mutate(GDP_per_capita = GDP_total / Population) %>%
  left_join(internet_data, by = c("Country Name", "Country Code", "Year")) %>%
  left_join(education_data, by = c("Country Name", "Country Code", "Year")) %>%
  filter(Year >= 2000) %>%
  select(`Country Name`, `Country Code`, Year, GDP_per_capita, Internet_users_pct, Out_of_school_pct) %>%
  filter(!str_detect(`Country Name`, "World|income|Arab|East Asia|Europe|Latin|Middle|North America|South Asia|Sub-Saharan"))
  complete_countries <- analysis_data %>%
  filter(!is.na(GDP_per_capita), !is.na(Internet_users_pct)) %>%
  distinct(`Country Name`)

glimpse(analysis_data)
Rows: 1,124
Columns: 6
$ `Country Name`     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ `Country Code`     <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "A…
$ Year               <dbl> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 200…
$ GDP_per_capita     <dbl> 180.1884, 142.9034, 182.1740, 199.6432, 221.8305, 2…
$ Internet_users_pct <dbl> NA, 0.004722568, 0.004561395, 0.087891253, 0.105809…
$ Out_of_school_pct  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

1 Introduction

This project explores global development indicators from 2000 onwards.
We are particularly interested in understanding the relationship between economic development, digital access, and education across countries.

The main objectives of this analysis are:

  • Examine trends in GDP per capita and internet usage over time.
  • Understand how economic prosperity correlates with digital connectivity.
  • Identify countries that deviate from expected patterns, highlighting potential development gaps.
  • Provide insights into the evolution of key development indicators over the last two decades.

1.1 Data Cleaning and Preparation

The dataset was obtained from the World Bank Global Development Indicators service.

Steps undertaken for data preparation:

  • Converted all non-numeric entries to NA and converted years to numeric format.

  • Filtered for the indicators of interest: GDP, population, internet usage, and adolescents out of school.

  • Calculated GDP per capita using GDP and population data.

  • Filtered to include data from the year 2000 onward.

  • Excluded aggregate regions (e.g., World, East Asia & Pacific) to focus on individual countries.

  • Ensured only countries with sufficient data coverage were included in the analysis.

Outcome:

  • The final dataset included complete and reliable data for 150+ countries over 20+ years.

  • Missing data was minimal for key variables and did not significantly affect the analysis.

2 Exploratory Data Analysis

2.1

2.1.1 3.1 Summary Statistics

[1] "Summary Statistics:"
 GDP_per_capita     Internet_users_pct 
 Min.   :   110.5   Min.   : 0.004561  
 1st Qu.:  1371.4   1st Qu.: 7.676785  
 Median :  6436.7   Median :34.920986  
 Mean   : 19182.5   Mean   :40.249840  
 3rd Qu.: 34462.1   3rd Qu.:72.330369  
 Max.   :240862.2   Max.   :99.687020  
                    NA's   :132        
[1] "Detailed Statistics:"
# A tibble: 1 × 14
  GDP_mean GDP_median GDP_sd GDP_min GDP_max Internet_mean Internet_median
     <dbl>      <dbl>  <dbl>   <dbl>   <dbl>         <dbl>           <dbl>
1   19183.      6437. 29197.    110. 240862.          40.2            34.9
# ℹ 7 more variables: Internet_sd <dbl>, Internet_min <dbl>,
#   Internet_max <dbl>, total_observations <int>, gdp_available <int>,
#   internet_available <int>, both_available <int>

Insight: GDP per capita is skewed with a few very high-income countries. Internet usage shows a wide spread with gradual global adoption.

2.1.2 3.2 Data Availability by Year

data_by_year <- analysis_data %>%
  group_by(Year) %>%
  summarise(
    countries = n_distinct(`Country Name`),
    gdp_available = sum(!is.na(GDP_per_capita)),
    internet_available = sum(!is.na(Internet_users_pct)),
    both_available = sum(!is.na(GDP_per_capita) & !is.na(Internet_users_pct)),
    .groups = "drop"
  )

print("Data Availability by Year:")
print(data_by_year)

Insight: Data availability improves after 2005, allowing reliable comparisons across countries.

2.1.3 3.3 Correlation Analysis

Correlation between GDP per capita and Internet usage: 0.645 

Insight: A strong positive correlation ~0.645 shows that richer countries tend to have better internet

2.1.4 3.4 Visualizations

2.1.4.1 3.4.1 GDP per Capita Distribution

Insight: Most countries cluster at lower GDP levels; a few countries are extreme high-income outliers.

2.1.4.2 3.4.2 Internet Usage Distribution

Insight: Internet usage varies widely; adoption has increased over time, especially in developing countries.

2.1.4.3 3.4.3 GDP vs Internet Usage Scatter Plot

`geom_smooth()` using formula = 'y ~ x'

Insight: Positive trend confirms richer countries generally have higher internet penetration, with some notable outliers.

2.1.4.4 3.4.4 Time Series Analysis of Global Internet Usage

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Insight: Internet usage has steadily increased globally, reflecting digital adoption trends across nations.

2.1.4.5 3.4.5 Internet Access by Economic Development Level

Insight: Higher GDP quartiles consistently have higher internet usage; some low-income countries outperform expectations.

2.1.4.6 3.5 Outlier Analysis

[1] "Countries with High GDP but Low Internet Usage:"
# A tibble: 46 × 4
   `Country Name`  Year GDP_per_capita Internet_users_pct
   <chr>          <dbl>          <dbl>              <dbl>
 1 Monaco          2003        111310.               49.5
 2 Monaco          2002         90152.               48.0
 3 Monaco          2001         82410.               46.6
 4 Monaco          2000         81764.               42.2
 5 San Marino      2011         55815.               49.6
 6 Italy           2008         40945.               44.5
 7 Japan           2000         39169.               30.0
 8 Italy           2007         37871.               40.8
 9 San Marino      2000         37474.               48.8
10 Italy           2009         37227.               48.8
# ℹ 36 more rows
[1] "Countries with Low GDP but High Internet Usage:"
# A tibble: 25 × 4
   `Country Name`  Year GDP_per_capita Internet_users_pct
   <chr>          <dbl>          <dbl>              <dbl>
 1 Fiji            2021          4656.               87.7
 2 Fiji            2020          4816.               84.9
 3 Ukraine         2021          4556.               79.2
 4 Uzbekistan      2021          1993.               76.6
 5 Ukraine         2020          3543.               75.0
 6 Uzbekistan      2020          1759.               71.1
 7 Algeria         2021          4216.               70.8
 8 Uzbekistan      2019          1795.               70.4
 9 Ukraine         2019          3460.               70.1
10 Indonesia       2022          4788.               66.5
# ℹ 15 more rows

Insight: Outliers suggest countries with unique economic or technological circumstances; these may be of interest for further study.

2.1.4.8 3.7 Key Insights Summary


=== KEY INSIGHTS FROM EXPLORATORY ANALYSIS ===
1. Data Coverage: 1124 total observations
2. GDP-Internet Correlation: 0.645 
3. Countries with complete data: 47 
4. Year range: 2000 to 2023 
5. Average global internet usage has strong correlation with economic development

Summary:

  • Economic development strongly correlates with digital access.

  • Global internet adoption is steadily increasing.

  • Outliers and trends provide insight into countries that perform above or below expectations.

3 Key Insights and Implications

  1. Economic and Digital Development: Wealthier countries generally have better internet access, confirming the strong correlation between GDP per capita and digital infrastructure.

  2. Global Trends: Internet adoption has increased globally, with notable improvements in developing regions over the last decade.

  3. Outliers: Some countries deviate from expected patterns, providing opportunities for targeted policy interventions or further research.

  4. Education and Development: While not the primary focus here, preliminary analysis shows countries with low internet usage also often have higher rates of adolescents out of school, suggesting overlapping development challenges.

Limitations:

  • Missing data for some countries, especially in early years, may bias trends.

  • Indicators are national averages and may mask regional disparities.

Future Research:

  • Examine subnational data to capture regional inequalities.

  • Explore causal relationships between GDP growth and digital adoption.

  • Investigate other indicators such as life expectancy, literacy, or healthcare access in conjunction with digital access.