Global Country Information

Author

Madina Kudanova

Introduction

This project develops practical competency in transforming wide-format datasets into tidy formats suitable for downstream analysis. All transformations and data manipulations are performed using the `tidyr` and `dplyr` packages in R, with visualizations built using `ggplot2`.

Dataset used:

Global Country Information Dataset 2023: www.kaggle.com/datasets/nelgiriyewithana/countries-of-the-world-2023

Approach

This dataset contains 195 rows, representing individual countries, and 35 columns that capture economic, health, and demographic indicators. Variables such as GDP, life expectancy, infant mortality, CO₂ emissions, birth rate, and physicians per thousand people are each stored as separate columns. This structure makes the dataset a clear example of wide format, where one entity appears per row and many measurements are spread across multiple columns.

Before reshaping the dataset, several data-cleaning steps were necessary. Some numeric variables were stored as text because they included formatting symbols such as dollar signs, commas, or percent signs. In addition, several column names contained special characters, spaces, or line breaks that made them difficult to work with in R. To address these issues, column names were standardized using string replacement functions from the stringr package, and formatted numeric values were converted into proper numeric variables using the parse_number function.

After cleaning the dataset, the data was transformed from wide format into tidy format using the pivot_longer function from the tidyr package. In the resulting tidy dataset, each row represents a single indicator measurement for a specific country, stored in two columns: indicator and value. This structure makes the data easier to filter, reshape, and analyze.

Once the dataset was converted to tidy format, the analysis focused on examining the relationship between GDP and life expectancy across countries. Life expectancy was selected as the primary outcome variable, while GDP was used as an indicator of economic development. The tidy dataset was filtered to retain only these two variables, and a scatter plot was created to visualize the relationship between them. Because GDP values vary widely between countries, the GDP axis was displayed on a logarithmic scale to improve readability and reveal patterns more clearly.

This approach allows us to explore whether countries with larger economies tend to exhibit higher life expectancy, while also illustrating how tidy data principles make it easier to investigate relationships between global development indicators.

Base Code

library(tidyverse)
# Raw Data Import from my GitHub repository 
raw_data <- read_csv(
  "https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/Project2/world-data-2023.csv")
Rows: 195 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): Country, Abbreviation, Agricultural Land( %), Capital/Major City, ...
dbl  (9): Birth Rate, Calling Code, Fertility Rate, Infant mortality, Life e...
num  (7): Density
(P/Km2), Land Area(Km2), Armed Forces size, Co2-Emissions,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(raw_data)
Rows: 195
Columns: 35
$ Country                                     <chr> "Afghanistan", "Albania", …
$ `Density\n(P/Km2)`                          <dbl> 60, 105, 18, 164, 26, 223,…
$ Abbreviation                                <chr> "AF", "AL", "DZ", "AD", "A…
$ `Agricultural Land( %)`                     <chr> "58.10%", "43.10%", "17.40…
$ `Land Area(Km2)`                            <dbl> 652230, 28748, 2381741, 46…
$ `Armed Forces size`                         <dbl> 323000, 9000, 317000, NA, …
$ `Birth Rate`                                <dbl> 32.49, 11.78, 24.28, 7.20,…
$ `Calling Code`                              <dbl> 93, 355, 213, 376, 244, 1,…
$ `Capital/Major City`                        <chr> "Kabul", "Tirana", "Algier…
$ `Co2-Emissions`                             <dbl> 8672, 4536, 150006, 469, 3…
$ CPI                                         <dbl> 149.90, 119.05, 151.36, NA…
$ `CPI Change (%)`                            <chr> "2.30%", "1.40%", "2.00%",…
$ `Currency-Code`                             <chr> "AFN", "ALL", "DZD", "EUR"…
$ `Fertility Rate`                            <dbl> 4.47, 1.62, 3.02, 1.27, 5.…
$ `Forested Area (%)`                         <chr> "2.10%", "28.10%", "0.80%"…
$ `Gasoline Price`                            <chr> "$0.70", "$1.36", "$0.28",…
$ GDP                                         <chr> "$19,101,353,833", "$15,27…
$ `Gross primary education enrollment (%)`    <chr> "104.00%", "107.00%", "109…
$ `Gross tertiary education enrollment (%)`   <chr> "9.70%", "55.00%", "51.40%…
$ `Infant mortality`                          <dbl> 47.9, 7.8, 20.1, 2.7, 51.6…
$ `Largest city`                              <chr> "Kabul", "Tirana", "Algier…
$ `Life expectancy`                           <dbl> 64.5, 78.5, 76.7, NA, 60.8…
$ `Maternal mortality ratio`                  <dbl> 638, 15, 112, NA, 241, 42,…
$ `Minimum wage`                              <chr> "$0.43", "$1.12", "$0.95",…
$ `Official language`                         <chr> "Pashto", "Albanian", "Ara…
$ `Out of pocket health expenditure`          <chr> "78.40%", "56.90%", "28.10…
$ `Physicians per thousand`                   <dbl> 0.28, 1.20, 1.72, 3.33, 0.…
$ Population                                  <dbl> 38041754, 2854191, 4305305…
$ `Population: Labor force participation (%)` <chr> "48.90%", "55.70%", "41.20…
$ `Tax revenue (%)`                           <chr> "9.30%", "18.60%", "37.20%…
$ `Total tax rate`                            <chr> "71.40%", "36.60%", "66.10…
$ `Unemployment rate`                         <chr> "11.12%", "12.33%", "11.70…
$ Urban_population                            <dbl> 9797273, 1747593, 31510100…
$ Latitude                                    <dbl> 33.939110, 41.153332, 28.0…
$ Longitude                                   <dbl> 67.709953, 20.168331, 1.65…
head(raw_data)
# A tibble: 6 × 35
  Country             `Density\n(P/Km2)` Abbreviation `Agricultural Land( %)`
  <chr>                            <dbl> <chr>        <chr>                  
1 Afghanistan                         60 AF           58.10%                 
2 Albania                            105 AL           43.10%                 
3 Algeria                             18 DZ           17.40%                 
4 Andorra                            164 AD           40.00%                 
5 Angola                              26 AO           47.50%                 
6 Antigua and Barbuda                223 AG           20.50%                 
# ℹ 31 more variables: `Land Area(Km2)` <dbl>, `Armed Forces size` <dbl>,
#   `Birth Rate` <dbl>, `Calling Code` <dbl>, `Capital/Major City` <chr>,
#   `Co2-Emissions` <dbl>, CPI <dbl>, `CPI Change (%)` <chr>,
#   `Currency-Code` <chr>, `Fertility Rate` <dbl>, `Forested Area (%)` <chr>,
#   `Gasoline Price` <chr>, GDP <chr>,
#   `Gross primary education enrollment (%)` <chr>,
#   `Gross tertiary education enrollment (%)` <chr>, …

Data Cleaning and Tidying

# Clean column names
names(raw_data) <- names(raw_data) |>
  str_replace_all("\n", "_") |>
  str_replace_all("[ %()/:-]", "_") |>
  str_replace_all("_+", "_") |>
  str_replace_all("_$", "") |>
  str_to_lower()

# Convert formatted text columns to numeric
countries_clean <- raw_data |>
  mutate(
    agricultural_land = parse_number(agricultural_land),
    cpi_change = parse_number(cpi_change),
    forested_area = parse_number(forested_area),
    gasoline_price = parse_number(gasoline_price),
    gdp = parse_number(gdp),
    gross_primary_education_enrollment = parse_number(gross_primary_education_enrollment),
    gross_tertiary_education_enrollment = parse_number(gross_tertiary_education_enrollment),
    minimum_wage = parse_number(minimum_wage),
    out_of_pocket_health_expenditure = parse_number(out_of_pocket_health_expenditure),
    population_labor_force_participation = parse_number(population_labor_force_participation),
    tax_revenue = parse_number(tax_revenue),
    total_tax_rate = parse_number(total_tax_rate),
    unemployment_rate = parse_number(unemployment_rate)
  )

# Reshape from wide to long
countries_tidy <- countries_clean |>
  pivot_longer(
    cols = -c(
      country,
      abbreviation,
      capital_major_city,
      largest_city,
      official_language,
      currency_code,
      latitude,
      longitude
    ),
    names_to = "indicator",
    values_to = "value"
  )

# Check the tidy dataset
glimpse(countries_tidy)
Rows: 5,265
Columns: 10
$ country            <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ abbreviation       <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF…
$ capital_major_city <chr> "Kabul", "Kabul", "Kabul", "Kabul", "Kabul", "Kabul…
$ currency_code      <chr> "AFN", "AFN", "AFN", "AFN", "AFN", "AFN", "AFN", "A…
$ largest_city       <chr> "Kabul", "Kabul", "Kabul", "Kabul", "Kabul", "Kabul…
$ official_language  <chr> "Pashto", "Pashto", "Pashto", "Pashto", "Pashto", "…
$ latitude           <dbl> 33.93911, 33.93911, 33.93911, 33.93911, 33.93911, 3…
$ longitude          <dbl> 67.70995, 67.70995, 67.70995, 67.70995, 67.70995, 6…
$ indicator          <chr> "density_p_km2", "agricultural_land", "land_area_km…
$ value              <dbl> 6.000000e+01, 5.810000e+01, 6.522300e+05, 3.230000e…
head(countries_tidy)
# A tibble: 6 × 10
  country     abbreviation capital_major_city currency_code largest_city
  <chr>       <chr>        <chr>              <chr>         <chr>       
1 Afghanistan AF           Kabul              AFN           Kabul       
2 Afghanistan AF           Kabul              AFN           Kabul       
3 Afghanistan AF           Kabul              AFN           Kabul       
4 Afghanistan AF           Kabul              AFN           Kabul       
5 Afghanistan AF           Kabul              AFN           Kabul       
6 Afghanistan AF           Kabul              AFN           Kabul       
# ℹ 5 more variables: official_language <chr>, latitude <dbl>, longitude <dbl>,
#   indicator <chr>, value <dbl>
nrow(countries_tidy)
[1] 5265

Analysis: Examining the Relationship Between GDP and Life Expectancy

# Prep dataset for GDP vs Life Expectancy comparison
analysis_data <- countries_tidy |>
  filter(indicator %in% c("life_expectancy", "gdp")) |>   # keep only relevant indicators
  pivot_wider(names_from = indicator, values_from = value) |>  # reshape for comparison
  drop_na(gdp, life_expectancy)   # remove missing values

# Select top 10 countries by GDP for labels
top_gdp <- analysis_data |>
  arrange(desc(gdp)) |>
  slice_head(n = 10)

ggplot(analysis_data, aes(x = gdp, y = life_expectancy, color = life_expectancy)) +
  geom_point(size = 2.5, alpha = 0.8) +
  scale_x_log10(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  scale_color_viridis_c() +
  labs(
    title = "                Relationship Between GDP and Life Expectancy
        ",
    x = "GDP",
    y = "Life Expectancy",
    color = "Life Expectancy"
  ) +
  theme_minimal()

Conclusion

In this project, a wide-format global country dataset was cleaned and transformed into tidy format to enable meaningful analysis. The original dataset contained multiple indicators stored across many columns, as well as several numeric values recorded as formatted text containing symbols such as percent signs, dollar signs, and commas. These inconsistencies were addressed by standardizing column names and converting formatted values into numeric variables. The dataset was then reshaped from wide format into long tidy format using pivot_longer(), resulting in a structure where each row represents one indicator measurement for a specific country.

Once the data was tidied, exploratory analysis focused on examining the relationship between GDP and life expectancy across countries. A scatter plot was used to visualize the association between these variables. Because GDP values vary dramatically across countries, the GDP axis was displayed on a logarithmic scale to improve readability and better reveal patterns in the data.

The results suggest a generally positive relationship between GDP and life expectancy. Countries with larger economies tend to exhibit higher life expectancy values, although the relationship is not perfectly linear. Some countries with moderate GDP levels still achieve relatively high life expectancy, indicating that additional factors such as healthcare systems, social policies, and public health conditions also influence population health outcomes.

Overall, the analysis demonstrates how transforming data into tidy format enables flexible exploration of relationships between variables. By restructuring the dataset and addressing inconsistencies in the original data, the indicators could be easily filtered, reshaped, and visualized to investigate meaningful global patterns.