library(tidyverse)Global Country Information
Introduction
This project develops practical competency in transforming wide-format datasets into tidy formats suitable for downstream analysis. All transformations and data manipulations are performed using the `tidyr` and `dplyr` packages in R, with visualizations built using `ggplot2`.
Dataset used:
Global Country Information Dataset 2023: www.kaggle.com/datasets/nelgiriyewithana/countries-of-the-world-2023
Approach
This dataset contains 195 rows, representing individual countries, and 35 columns that capture economic, health, and demographic indicators. Variables such as GDP, life expectancy, infant mortality, CO₂ emissions, birth rate, and physicians per thousand people are each stored as separate columns. This structure makes the dataset a clear example of wide format, where one entity appears per row and many measurements are spread across multiple columns.
Before reshaping the dataset, several data-cleaning steps were necessary. Some numeric variables were stored as text because they included formatting symbols such as dollar signs, commas, or percent signs. In addition, several column names contained special characters, spaces, or line breaks that made them difficult to work with in R. To address these issues, column names were standardized using string replacement functions from the stringr package, and formatted numeric values were converted into proper numeric variables using the parse_number function.
After cleaning the dataset, the data was transformed from wide format into tidy format using the pivot_longer function from the tidyr package. In the resulting tidy dataset, each row represents a single indicator measurement for a specific country, stored in two columns: indicator and value. This structure makes the data easier to filter, reshape, and analyze.
Once the dataset was converted to tidy format, the analysis focused on examining the relationship between GDP and life expectancy across countries. Life expectancy was selected as the primary outcome variable, while GDP was used as an indicator of economic development. The tidy dataset was filtered to retain only these two variables, and a scatter plot was created to visualize the relationship between them. Because GDP values vary widely between countries, the GDP axis was displayed on a logarithmic scale to improve readability and reveal patterns more clearly.
This approach allows us to explore whether countries with larger economies tend to exhibit higher life expectancy, while also illustrating how tidy data principles make it easier to investigate relationships between global development indicators.
Base Code
# Raw Data Import from my GitHub repository
raw_data <- read_csv(
"https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/Project2/world-data-2023.csv")Rows: 195 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): Country, Abbreviation, Agricultural Land( %), Capital/Major City, ...
dbl (9): Birth Rate, Calling Code, Fertility Rate, Infant mortality, Life e...
num (7): Density
(P/Km2), Land Area(Km2), Armed Forces size, Co2-Emissions,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(raw_data)Rows: 195
Columns: 35
$ Country <chr> "Afghanistan", "Albania", …
$ `Density\n(P/Km2)` <dbl> 60, 105, 18, 164, 26, 223,…
$ Abbreviation <chr> "AF", "AL", "DZ", "AD", "A…
$ `Agricultural Land( %)` <chr> "58.10%", "43.10%", "17.40…
$ `Land Area(Km2)` <dbl> 652230, 28748, 2381741, 46…
$ `Armed Forces size` <dbl> 323000, 9000, 317000, NA, …
$ `Birth Rate` <dbl> 32.49, 11.78, 24.28, 7.20,…
$ `Calling Code` <dbl> 93, 355, 213, 376, 244, 1,…
$ `Capital/Major City` <chr> "Kabul", "Tirana", "Algier…
$ `Co2-Emissions` <dbl> 8672, 4536, 150006, 469, 3…
$ CPI <dbl> 149.90, 119.05, 151.36, NA…
$ `CPI Change (%)` <chr> "2.30%", "1.40%", "2.00%",…
$ `Currency-Code` <chr> "AFN", "ALL", "DZD", "EUR"…
$ `Fertility Rate` <dbl> 4.47, 1.62, 3.02, 1.27, 5.…
$ `Forested Area (%)` <chr> "2.10%", "28.10%", "0.80%"…
$ `Gasoline Price` <chr> "$0.70", "$1.36", "$0.28",…
$ GDP <chr> "$19,101,353,833", "$15,27…
$ `Gross primary education enrollment (%)` <chr> "104.00%", "107.00%", "109…
$ `Gross tertiary education enrollment (%)` <chr> "9.70%", "55.00%", "51.40%…
$ `Infant mortality` <dbl> 47.9, 7.8, 20.1, 2.7, 51.6…
$ `Largest city` <chr> "Kabul", "Tirana", "Algier…
$ `Life expectancy` <dbl> 64.5, 78.5, 76.7, NA, 60.8…
$ `Maternal mortality ratio` <dbl> 638, 15, 112, NA, 241, 42,…
$ `Minimum wage` <chr> "$0.43", "$1.12", "$0.95",…
$ `Official language` <chr> "Pashto", "Albanian", "Ara…
$ `Out of pocket health expenditure` <chr> "78.40%", "56.90%", "28.10…
$ `Physicians per thousand` <dbl> 0.28, 1.20, 1.72, 3.33, 0.…
$ Population <dbl> 38041754, 2854191, 4305305…
$ `Population: Labor force participation (%)` <chr> "48.90%", "55.70%", "41.20…
$ `Tax revenue (%)` <chr> "9.30%", "18.60%", "37.20%…
$ `Total tax rate` <chr> "71.40%", "36.60%", "66.10…
$ `Unemployment rate` <chr> "11.12%", "12.33%", "11.70…
$ Urban_population <dbl> 9797273, 1747593, 31510100…
$ Latitude <dbl> 33.939110, 41.153332, 28.0…
$ Longitude <dbl> 67.709953, 20.168331, 1.65…
head(raw_data)# A tibble: 6 × 35
Country `Density\n(P/Km2)` Abbreviation `Agricultural Land( %)`
<chr> <dbl> <chr> <chr>
1 Afghanistan 60 AF 58.10%
2 Albania 105 AL 43.10%
3 Algeria 18 DZ 17.40%
4 Andorra 164 AD 40.00%
5 Angola 26 AO 47.50%
6 Antigua and Barbuda 223 AG 20.50%
# ℹ 31 more variables: `Land Area(Km2)` <dbl>, `Armed Forces size` <dbl>,
# `Birth Rate` <dbl>, `Calling Code` <dbl>, `Capital/Major City` <chr>,
# `Co2-Emissions` <dbl>, CPI <dbl>, `CPI Change (%)` <chr>,
# `Currency-Code` <chr>, `Fertility Rate` <dbl>, `Forested Area (%)` <chr>,
# `Gasoline Price` <chr>, GDP <chr>,
# `Gross primary education enrollment (%)` <chr>,
# `Gross tertiary education enrollment (%)` <chr>, …
Data Cleaning and Tidying
# Clean column names
names(raw_data) <- names(raw_data) |>
str_replace_all("\n", "_") |>
str_replace_all("[ %()/:-]", "_") |>
str_replace_all("_+", "_") |>
str_replace_all("_$", "") |>
str_to_lower()
# Convert formatted text columns to numeric
countries_clean <- raw_data |>
mutate(
agricultural_land = parse_number(agricultural_land),
cpi_change = parse_number(cpi_change),
forested_area = parse_number(forested_area),
gasoline_price = parse_number(gasoline_price),
gdp = parse_number(gdp),
gross_primary_education_enrollment = parse_number(gross_primary_education_enrollment),
gross_tertiary_education_enrollment = parse_number(gross_tertiary_education_enrollment),
minimum_wage = parse_number(minimum_wage),
out_of_pocket_health_expenditure = parse_number(out_of_pocket_health_expenditure),
population_labor_force_participation = parse_number(population_labor_force_participation),
tax_revenue = parse_number(tax_revenue),
total_tax_rate = parse_number(total_tax_rate),
unemployment_rate = parse_number(unemployment_rate)
)
# Reshape from wide to long
countries_tidy <- countries_clean |>
pivot_longer(
cols = -c(
country,
abbreviation,
capital_major_city,
largest_city,
official_language,
currency_code,
latitude,
longitude
),
names_to = "indicator",
values_to = "value"
)
# Check the tidy dataset
glimpse(countries_tidy)Rows: 5,265
Columns: 10
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ abbreviation <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF…
$ capital_major_city <chr> "Kabul", "Kabul", "Kabul", "Kabul", "Kabul", "Kabul…
$ currency_code <chr> "AFN", "AFN", "AFN", "AFN", "AFN", "AFN", "AFN", "A…
$ largest_city <chr> "Kabul", "Kabul", "Kabul", "Kabul", "Kabul", "Kabul…
$ official_language <chr> "Pashto", "Pashto", "Pashto", "Pashto", "Pashto", "…
$ latitude <dbl> 33.93911, 33.93911, 33.93911, 33.93911, 33.93911, 3…
$ longitude <dbl> 67.70995, 67.70995, 67.70995, 67.70995, 67.70995, 6…
$ indicator <chr> "density_p_km2", "agricultural_land", "land_area_km…
$ value <dbl> 6.000000e+01, 5.810000e+01, 6.522300e+05, 3.230000e…
head(countries_tidy)# A tibble: 6 × 10
country abbreviation capital_major_city currency_code largest_city
<chr> <chr> <chr> <chr> <chr>
1 Afghanistan AF Kabul AFN Kabul
2 Afghanistan AF Kabul AFN Kabul
3 Afghanistan AF Kabul AFN Kabul
4 Afghanistan AF Kabul AFN Kabul
5 Afghanistan AF Kabul AFN Kabul
6 Afghanistan AF Kabul AFN Kabul
# ℹ 5 more variables: official_language <chr>, latitude <dbl>, longitude <dbl>,
# indicator <chr>, value <dbl>
nrow(countries_tidy)[1] 5265
Analysis: Examining the Relationship Between GDP and Life Expectancy
# Prep dataset for GDP vs Life Expectancy comparison
analysis_data <- countries_tidy |>
filter(indicator %in% c("life_expectancy", "gdp")) |> # keep only relevant indicators
pivot_wider(names_from = indicator, values_from = value) |> # reshape for comparison
drop_na(gdp, life_expectancy) # remove missing values
# Select top 10 countries by GDP for labels
top_gdp <- analysis_data |>
arrange(desc(gdp)) |>
slice_head(n = 10)
ggplot(analysis_data, aes(x = gdp, y = life_expectancy, color = life_expectancy)) +
geom_point(size = 2.5, alpha = 0.8) +
scale_x_log10(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
scale_color_viridis_c() +
labs(
title = " Relationship Between GDP and Life Expectancy
",
x = "GDP",
y = "Life Expectancy",
color = "Life Expectancy"
) +
theme_minimal()Conclusion
In this project, a wide-format global country dataset was cleaned and transformed into tidy format to enable meaningful analysis. The original dataset contained multiple indicators stored across many columns, as well as several numeric values recorded as formatted text containing symbols such as percent signs, dollar signs, and commas. These inconsistencies were addressed by standardizing column names and converting formatted values into numeric variables. The dataset was then reshaped from wide format into long tidy format using pivot_longer(), resulting in a structure where each row represents one indicator measurement for a specific country.
Once the data was tidied, exploratory analysis focused on examining the relationship between GDP and life expectancy across countries. A scatter plot was used to visualize the association between these variables. Because GDP values vary dramatically across countries, the GDP axis was displayed on a logarithmic scale to improve readability and better reveal patterns in the data.
The results suggest a generally positive relationship between GDP and life expectancy. Countries with larger economies tend to exhibit higher life expectancy values, although the relationship is not perfectly linear. Some countries with moderate GDP levels still achieve relatively high life expectancy, indicating that additional factors such as healthcare systems, social policies, and public health conditions also influence population health outcomes.
Overall, the analysis demonstrates how transforming data into tidy format enables flexible exploration of relationships between variables. By restructuring the dataset and addressing inconsistencies in the original data, the indicators could be easily filtered, reshaped, and visualized to investigate meaningful global patterns.