Project 2 – Data Transformations: Codebase

Author

Muhammad Suffyan Khan

Published

March 5, 2026

Introduction

The objective of this project is to develop practical experience transforming wide-format datasets into tidy datasets suitable for analysis. In many real-world datasets, variables are spread across multiple columns, which makes analysis and visualization difficult. The tidy data principles described by Hadley Wickham recommend structuring datasets so that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

Using the tidyr and dplyr packages in R, this project demonstrates how wide datasets can be transformed into tidy formats through reproducible data transformation pipelines. Each dataset is first preserved in its raw wide format and then converted into a tidy structure that can be used for analysis and visualization.

Three independent datasets from Discussion 5A are used in this project:

  1. COVID-19 World Vaccination Progress
  2. Renewable Energy Capacity Time Series
  3. World GDP by Country (1960–2022)

Each dataset illustrates a different type of transformation scenario and allows meaningful analysis once the data is properly structured.

Dataset 1: COVID-19 World Vaccination Progress

Data Source

Dataset: COVID World Vaccination Progress Source: Kaggle Link: https://www.kaggle.com/datasets/gpreda/covid-world-vaccination-progress

This dataset contains vaccination statistics for countries around the world during the COVID-19 pandemic. The dataset includes multiple vaccination metrics such as total vaccinations, people vaccinated, people fully vaccinated, and daily vaccination rates.

Structure Before Tidying

The dataset includes several vaccination-related variables stored as separate columns. While the dataset contains useful information, the structure spreads similar measurements across multiple columns instead of representing them as a single variable describing the type of vaccination metric.

This structure is considered partially wide because the vaccination metrics represent the same type of measurement recorded across different variables.

Planned Transformation

The transformation process will reshape the dataset by converting the vaccination metric columns into a single categorical variable describing the metric type, while storing the corresponding values in another column.

This will be achieved using pivot_longer() from the tidyr package.

The transformation will result in a tidy dataset with variables such as:

  • country
  • date
  • metric
  • value

This structure allows easier filtering, grouping, and comparison across vaccination metrics.

Planned Analysis

After tidying the dataset, the analysis will examine vaccination progress across different countries. Visualizations will be created to show how vaccination rates change over time and how countries compare in terms of vaccination coverage.

Potential analysis includes:

  • Comparing vaccination rates between countries
  • Visualizing vaccination progress over time
  • Identifying countries with higher vaccination coverage

Dataset 2: Renewable Energy Capacity Time Series

Data Source

Dataset: Renewable Power Plants / Renewable Capacity Time Series Source: Kaggle Link: https://www.kaggle.com/datasets/eugeniyosetrov/renewable-power-plants

This dataset contains power generation data for multiple renewable energy technologies across several European countries. The data includes values for different energy sources such as solar, wind, hydro, and other renewable technologies.

Structure Before Tidying

The dataset is initially very wide. It includes columns representing different renewable energy sources as well as multiple columns representing different countries. The dataset also contains time series information representing energy generation across time.

Because these variables are stored across multiple columns rather than as values in a single variable column, the dataset is not in tidy format.

Planned Transformation

To tidy this dataset, the wide structure will be converted into a long format where:

  • energy type becomes a categorical variable
  • country becomes a variable
  • generation values are stored in a single column
  • time remains as an index variable

This transformation will allow the dataset to be analyzed more easily using grouping and visualization tools.

The main transformation will again use pivot_longer() to reshape the data.

Planned Analysis

After tidying the data, the following analyses will be performed:

  • Comparing renewable energy production across energy sources
  • Comparing renewable energy output between countries
  • Examining how renewable energy generation changes over time

These analyses will help illustrate differences in renewable energy usage and trends across countries and technologies.

Dataset 3: World GDP by Country (1960–2022)

Data Source

Dataset: World GDP by Country (1960–2022) Source: Kaggle Link: https://www.kaggle.com/datasets/annafabris/world-gdp-by-country-1960-2022

This dataset contains the gross domestic product (GDP) values for many countries across multiple years. Each country has GDP values listed for many different years within the same row.

Structure Before Tidying

The dataset stores each year as a separate column, which results in a wide structure. Instead of having a variable representing the year, the years appear as column names.

For example:

Country 1960 1961 1962 1963
USA value value value value
France value value value value
Japan value value value value

This format violates tidy data principles because the year variable is encoded as column headers rather than as a variable within the dataset.

Planned Transformation

The dataset will be transformed into tidy format by converting the year columns into a single year variable and placing GDP values into a corresponding gdp column.

This will be accomplished using pivot_longer().

The resulting tidy structure will resemble:

country | year | gdp |

This format allows the dataset to be easily analyzed and visualized using standard data analysis tools.

Planned Analysis

Once the dataset is tidy, the analysis will focus on economic trends across countries.

Possible analyses include:

  • Comparing GDP growth between countries
  • Calculating average GDP over time
  • Identifying countries with the fastest economic growth
  • Visualizing GDP trends using time series plots

Reproducibility Plan

All data transformations and analyses will be implemented in a Quarto Markdown file using the tidyr, dplyr, and ggplot2 packages in R.

The workflow will follow these steps for each dataset:

  1. Import the raw wide-format dataset from a CSV file.
  2. Inspect the structure of the raw dataset.
  3. Apply pivot_longer() and other dplyr transformations to create a tidy dataset.
  4. Perform analysis using the tidy dataset.
  5. Create summary tables and visualizations using ggplot2.

All code will be fully reproducible and executable from a clean R session.

Expected Outcome

By completing these transformations, the project will demonstrate how wide-format datasets commonly encountered in real-world data sources can be reshaped into tidy structures suitable for analysis.

The resulting tidy datasets will allow more flexible analysis, easier visualization, and clearer interpretation of trends in vaccination progress, renewable energy generation, and economic growth across countries.

Code Part Starts Here

library(tidyverse)
library(scales)

set.seed(607)

theme_set(theme_minimal())

Dataset 1 — COVID-19 World Vaccination Progress

Data Source

Source: Kaggle, COVID World Vaccination Progress. Raw CSV used in this project:

covid_url <- "https://raw.githubusercontent.com/suffyankhan77/Project2-DATA-607/refs/heads/main/country_vaccinations.csv"

Data Structure Before Tidying

covid_raw <- read_csv(covid_url)

dim(covid_raw)
[1] 86512    15
names(covid_raw)
 [1] "country"                             "iso_code"                           
 [3] "date"                                "total_vaccinations"                 
 [5] "people_vaccinated"                   "people_fully_vaccinated"            
 [7] "daily_vaccinations_raw"              "daily_vaccinations"                 
 [9] "total_vaccinations_per_hundred"      "people_vaccinated_per_hundred"      
[11] "people_fully_vaccinated_per_hundred" "daily_vaccinations_per_million"     
[13] "vaccines"                            "source_name"                        
[15] "source_website"                     
glimpse(covid_raw)
Rows: 86,512
Columns: 15
$ country                             <chr> "Afghanistan", "Afghanistan", "Afg…
$ iso_code                            <chr> "AFG", "AFG", "AFG", "AFG", "AFG",…
$ date                                <date> 2021-02-22, 2021-02-23, 2021-02-2…
$ total_vaccinations                  <dbl> 0, NA, NA, NA, NA, NA, 8200, NA, N…
$ people_vaccinated                   <dbl> 0, NA, NA, NA, NA, NA, 8200, NA, N…
$ people_fully_vaccinated             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ daily_vaccinations_raw              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ daily_vaccinations                  <dbl> NA, 1367, 1367, 1367, 1367, 1367, …
$ total_vaccinations_per_hundred      <dbl> 0.00, NA, NA, NA, NA, NA, 0.02, NA…
$ people_vaccinated_per_hundred       <dbl> 0.00, NA, NA, NA, NA, NA, 0.02, NA…
$ people_fully_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ daily_vaccinations_per_million      <dbl> NA, 34, 34, 34, 34, 34, 34, 40, 45…
$ vaccines                            <chr> "Johnson&Johnson, Oxford/AstraZene…
$ source_name                         <chr> "World Health Organization", "Worl…
$ source_website                      <chr> "https://covid19.who.int/", "https…
covid_raw %>% slice_head(n = 10)
# A tibble: 10 × 15
   country     iso_code date       total_vaccinations people_vaccinated
   <chr>       <chr>    <date>                  <dbl>             <dbl>
 1 Afghanistan AFG      2021-02-22                  0                 0
 2 Afghanistan AFG      2021-02-23                 NA                NA
 3 Afghanistan AFG      2021-02-24                 NA                NA
 4 Afghanistan AFG      2021-02-25                 NA                NA
 5 Afghanistan AFG      2021-02-26                 NA                NA
 6 Afghanistan AFG      2021-02-27                 NA                NA
 7 Afghanistan AFG      2021-02-28               8200              8200
 8 Afghanistan AFG      2021-03-01                 NA                NA
 9 Afghanistan AFG      2021-03-02                 NA                NA
10 Afghanistan AFG      2021-03-03                 NA                NA
# ℹ 10 more variables: people_fully_vaccinated <dbl>,
#   daily_vaccinations_raw <dbl>, daily_vaccinations <dbl>,
#   total_vaccinations_per_hundred <dbl>, people_vaccinated_per_hundred <dbl>,
#   people_fully_vaccinated_per_hundred <dbl>,
#   daily_vaccinations_per_million <dbl>, vaccines <chr>, source_name <chr>,
#   source_website <chr>

The original dataset contains several vaccination measures stored in separate columns, such as total_vaccinations, people_vaccinated, people_fully_vaccinated, and related rate variables. This makes the dataset partially wide because similar measurements are spread across multiple columns instead of being represented as values in a single variable. To normalize this structure, these related measurement columns are reshaped into two variables: metric and value.

Data Import and Tidying

covid_tidy <- covid_raw %>%
  mutate(date = as.Date(date)) %>%
  pivot_longer(
    cols = c(
      total_vaccinations,
      people_vaccinated,
      people_fully_vaccinated,
      daily_vaccinations_raw,
      daily_vaccinations,
      total_vaccinations_per_hundred,
      people_vaccinated_per_hundred,
      people_fully_vaccinated_per_hundred,
      daily_vaccinations_per_million
    ),
    names_to = "metric",
    values_to = "value"
  ) %>%
  rename_with(tolower) %>%
  mutate(metric = str_replace_all(metric, "_", " ")) %>%
  filter(!is.na(value))

glimpse(covid_tidy)
Rows: 455,194
Columns: 8
$ country        <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
$ iso_code       <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
$ date           <date> 2021-02-22, 2021-02-22, 2021-02-22, 2021-02-22, 2021-0…
$ vaccines       <chr> "Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, …
$ source_name    <chr> "World Health Organization", "World Health Organization…
$ source_website <chr> "https://covid19.who.int/", "https://covid19.who.int/",…
$ metric         <chr> "total vaccinations", "people vaccinated", "total vacci…
$ value          <dbl> 0.00, 0.00, 0.00, 0.00, 1367.00, 34.00, 1367.00, 34.00,…
covid_tidy %>% slice_head(n = 10)
# A tibble: 10 × 8
   country  iso_code date       vaccines source_name source_website metric value
   <chr>    <chr>    <date>     <chr>    <chr>       <chr>          <chr>  <dbl>
 1 Afghani… AFG      2021-02-22 Johnson… World Heal… https://covid… total…     0
 2 Afghani… AFG      2021-02-22 Johnson… World Heal… https://covid… peopl…     0
 3 Afghani… AFG      2021-02-22 Johnson… World Heal… https://covid… total…     0
 4 Afghani… AFG      2021-02-22 Johnson… World Heal… https://covid… peopl…     0
 5 Afghani… AFG      2021-02-23 Johnson… World Heal… https://covid… daily…  1367
 6 Afghani… AFG      2021-02-23 Johnson… World Heal… https://covid… daily…    34
 7 Afghani… AFG      2021-02-24 Johnson… World Heal… https://covid… daily…  1367
 8 Afghani… AFG      2021-02-24 Johnson… World Heal… https://covid… daily…    34
 9 Afghani… AFG      2021-02-25 Johnson… World Heal… https://covid… daily…  1367
10 Afghani… AFG      2021-02-25 Johnson… World Heal… https://covid… daily…    34

Missing and Inconsistent Value Handling

For this dataset, missing values are common because countries report different vaccination measures on different dates. Rather than imputing values, I remove only rows where the reshaped value is missing after pivoting. This preserves reported observations while avoiding unsupported assumptions.

covid_tidy %>%
  summarise(
    rows_after_tidying = n(),
    countries = n_distinct(country),
    metrics = n_distinct(metric),
    date_min = min(date),
    date_max = max(date)
  )
# A tibble: 1 × 5
  rows_after_tidying countries metrics date_min   date_max  
               <int>     <int>   <int> <date>     <date>    
1             455194       223       9 2020-12-02 2022-03-29

Analytical Method

The original discussion focus was vaccination rates across different countries. To address that question using the tidy dataset, I first identify the latest available value of people fully vaccinated per hundred for each country. I then compare selected countries over time using a line chart to show how vaccination progress evolved during the reporting period.

Analysis

covid_latest_full <- covid_tidy %>%
  filter(metric == "people fully vaccinated per hundred") %>%
  group_by(country) %>%
  filter(date == max(date, na.rm = TRUE)) %>%
  summarise(
    latest_date = max(date),
    fully_vaccinated_per_hundred = max(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(fully_vaccinated_per_hundred)) %>%
  arrange(desc(fully_vaccinated_per_hundred))

covid_latest_full %>% slice_head(n = 15)
# A tibble: 15 × 3
   country              latest_date fully_vaccinated_per_hundred
   <chr>                <date>                             <dbl>
 1 Gibraltar            2022-03-24                         122. 
 2 Pitcairn             2021-09-07                         100  
 3 United Arab Emirates 2022-03-29                          96.1
 4 Portugal             2022-03-10                          92.6
 5 Brunei               2022-03-18                          91.7
 6 Singapore            2022-03-28                          91.0
 7 Malta                2022-03-28                          90.4
 8 Chile                2022-03-22                          90.3
 9 Qatar                2022-03-24                          88.5
10 Cayman Islands       2022-03-25                          87.8
11 Niue                 2022-03-13                          87.8
12 Cuba                 2022-03-27                          87.6
13 South Korea          2022-03-29                          86.7
14 China                2022-03-24                          85.9
15 Spain                2022-03-23                          85.8

The table above lists the countries with the highest latest reported values of people fully vaccinated per hundred. This provides a direct comparison of vaccination coverage across countries using the most recent available data.

The following bar chart visualizes the top 10 countries by latest fully vaccinated rate.

covid_latest_full %>%
  slice_head(n = 10) %>%
  ggplot(aes(
    x = reorder(country, fully_vaccinated_per_hundred),
    y = fully_vaccinated_per_hundred
  )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Countries by Latest Fully Vaccinated per 100 People",
    x = "Country",
    y = "People fully vaccinated per 100"
  ) +
  scale_y_continuous(labels = label_number())

This plot shows that vaccination coverage varied substantially across countries. Countries such as Gibraltar, Portugal, Singapore, and the United Arab Emirates appear among the highest values in the dataset. Some values approach or exceed 100 per hundred people, which may reflect reporting conventions, booster campaigns, or additional dose coverage in the source data.

To examine vaccination progress over time, I also compare a small set of countries using the tidy dataset.

selected_countries <- c("United States", "India", "United Kingdom", "Germany", "Brazil")

covid_time_series <- covid_tidy %>%
  filter(
    metric == "people fully vaccinated per hundred",
    country %in% selected_countries
  )

covid_time_series %>%
  ggplot(aes(x = date, y = value, color = country)) +
  geom_line(linewidth = 0.8) +
  labs(
    title = "Vaccination Progress Over Time",
    subtitle = "People fully vaccinated per 100 people",
    x = "Date",
    y = "People fully vaccinated per 100",
    color = "Country"
  ) +
  scale_y_continuous(labels = label_number())

This line chart shows how vaccination campaigns progressed over time for the selected countries. The slope of each line indicates the pace of vaccination rollout. Countries such as the United Kingdom and Germany increased quickly during the early phases, while India and Brazil showed slower initial progress before later increases.

Interpretation

Using the tidy dataset, the analysis shows that vaccination coverage differs substantially across countries. The ranking of the latest fully vaccinated rates highlights which countries achieved broader coverage, while the time-series plot shows how vaccination campaigns progressed at different speeds across selected countries. Together, these outputs demonstrate how tidying the dataset makes cross-country comparison much easier.

Dataset 2 — Renewable Energy Capacity Time Series

Data Source

Source: Kaggle, Renewable Power Plants / Renewable Capacity Time Series. Raw CSV used in this project:

renewable_url <- "https://raw.githubusercontent.com/suffyankhan77/Project2-DATA-607/refs/heads/main/renewable_capacity_timeseries.csv"

Data Structure Before Tidying

renewable_raw <- read_csv(renewable_url)

dim(renewable_raw)
[1] 44166    40
names(renewable_raw)
 [1] "day"                           "CH_bioenergy_capacity"        
 [3] "CH_solar_capacity"             "CH_wind_onshore_capacity"     
 [5] "DE_bioenergy_capacity"         "DE_geothermal_capacity"       
 [7] "DE_solar_capacity"             "DE_wind_capacity"             
 [9] "DE_wind_offshore_capacity"     "DE_wind_onshore_capacity"     
[11] "DK_solar_capacity"             "DK_wind_capacity"             
[13] "DK_wind_offshore_capacity"     "DK_wind_onshore_capacity"     
[15] "FR_bioenergy_capacity"         "FR_geothermal_capacity"       
[17] "FR_hydro_capacity"             "FR_marine_capacity"           
[19] "FR_solar_capacity"             "FR_wind_onshore_capacity"     
[21] "GB-GBN_bioenergy_capacity"     "GB-GBN_hydro_capacity"        
[23] "GB-GBN_marine_capacity"        "GB-GBN_solar_capacity"        
[25] "GB-GBN_wind_capacity"          "GB-GBN_wind_offshore_capacity"
[27] "GB-GBN_wind_onshore_capacity"  "GB-NIR_bioenergy_capacity"    
[29] "GB-NIR_solar_capacity"         "GB-NIR_wind_onshore_capacity" 
[31] "GB-UKM_bioenergy_capacity"     "GB-UKM_hydro_capacity"        
[33] "GB-UKM_marine_capacity"        "GB-UKM_solar_capacity"        
[35] "GB-UKM_wind_capacity"          "GB-UKM_wind_offshore_capacity"
[37] "GB-UKM_wind_onshore_capacity"  "SE_wind_capacity"             
[39] "SE_wind_offshore_capacity"     "SE_wind_onshore_capacity"     
glimpse(renewable_raw)
Rows: 44,166
Columns: 40
$ day                             <date> 1900-01-01, 1900-01-02, 1900-01-03, 1…
$ CH_bioenergy_capacity           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ CH_solar_capacity               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ CH_wind_onshore_capacity        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DE_bioenergy_capacity           <dbl> 0.00, 0.08, 0.08, 0.08, 0.08, 0.08, 0.…
$ DE_geothermal_capacity          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DE_solar_capacity               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DE_wind_capacity                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DE_wind_offshore_capacity       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DE_wind_onshore_capacity        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DK_solar_capacity               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DK_wind_capacity                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DK_wind_offshore_capacity       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DK_wind_onshore_capacity        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FR_bioenergy_capacity           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FR_geothermal_capacity          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FR_hydro_capacity               <dbl> 122.496, 122.496, 122.496, 122.496, 12…
$ FR_marine_capacity              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FR_solar_capacity               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FR_wind_onshore_capacity        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-GBN_bioenergy_capacity`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-GBN_hydro_capacity`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-GBN_marine_capacity`        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-GBN_solar_capacity`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-GBN_wind_capacity`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-GBN_wind_offshore_capacity` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-GBN_wind_onshore_capacity`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-NIR_bioenergy_capacity`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-NIR_solar_capacity`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-NIR_wind_onshore_capacity`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-UKM_bioenergy_capacity`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-UKM_hydro_capacity`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-UKM_marine_capacity`        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-UKM_solar_capacity`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-UKM_wind_capacity`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-UKM_wind_offshore_capacity` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GB-UKM_wind_onshore_capacity`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ SE_wind_capacity                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ SE_wind_offshore_capacity       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ SE_wind_onshore_capacity        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
renewable_raw %>% slice_head(n = 10)
# A tibble: 10 × 40
   day        CH_bioenergy_capacity CH_solar_capacity CH_wind_onshore_capacity
   <date>                     <dbl>             <dbl>                    <dbl>
 1 1900-01-01                     0                 0                        0
 2 1900-01-02                     0                 0                        0
 3 1900-01-03                     0                 0                        0
 4 1900-01-04                     0                 0                        0
 5 1900-01-05                     0                 0                        0
 6 1900-01-06                     0                 0                        0
 7 1900-01-07                     0                 0                        0
 8 1900-01-08                     0                 0                        0
 9 1900-01-09                     0                 0                        0
10 1900-01-10                     0                 0                        0
# ℹ 36 more variables: DE_bioenergy_capacity <dbl>,
#   DE_geothermal_capacity <dbl>, DE_solar_capacity <dbl>,
#   DE_wind_capacity <dbl>, DE_wind_offshore_capacity <dbl>,
#   DE_wind_onshore_capacity <dbl>, DK_solar_capacity <dbl>,
#   DK_wind_capacity <dbl>, DK_wind_offshore_capacity <dbl>,
#   DK_wind_onshore_capacity <dbl>, FR_bioenergy_capacity <dbl>,
#   FR_geothermal_capacity <dbl>, FR_hydro_capacity <dbl>, …

The original dataset is wide because the renewable generation variables are stored as separate columns such as CH_solar_capacity, DE_wind_capacity, and FR_hydro_capacity. In this structure, both the country and the renewable energy type are embedded in the column names rather than represented as separate variables.

Data Import and Tidying

renewable_tidy <- renewable_raw %>%
  mutate(day = as.Date(day)) %>%
  pivot_longer(
    cols = -day,
    names_to = "country_energy",
    values_to = "capacity"
  ) %>%
  extract(
    country_energy,
    into = c("country_code", "energy_type"),
    regex = "^(.*?)_(.*)_capacity$",
    remove = TRUE
  ) %>%
  rename_with(tolower) %>%
  mutate(
    energy_type = str_replace_all(energy_type, "_", " "),
    country_code = str_to_upper(country_code)
  ) %>%
  filter(!is.na(capacity))

glimpse(renewable_tidy)
Rows: 1,722,474
Columns: 4
$ day          <date> 1900-01-01, 1900-01-01, 1900-01-01, 1900-01-01, 1900-01-…
$ country_code <chr> "CH", "CH", "CH", "DE", "DE", "DE", "DE", "DE", "DE", "DK…
$ energy_type  <chr> "bioenergy", "solar", "wind onshore", "bioenergy", "geoth…
$ capacity     <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0…
renewable_tidy %>% slice_head(n = 10)
# A tibble: 10 × 4
   day        country_code energy_type   capacity
   <date>     <chr>        <chr>            <dbl>
 1 1900-01-01 CH           bioenergy            0
 2 1900-01-01 CH           solar                0
 3 1900-01-01 CH           wind onshore         0
 4 1900-01-01 DE           bioenergy            0
 5 1900-01-01 DE           geothermal           0
 6 1900-01-01 DE           solar                0
 7 1900-01-01 DE           wind                 0
 8 1900-01-01 DE           wind offshore        0
 9 1900-01-01 DE           wind onshore         0
10 1900-01-01 DK           solar                0

Missing and Inconsistent Value Handling

In this dataset, the main inconsistency is that country and energy type are encoded together in the original column names. I normalize this structure by extracting them into separate variables. Missing capacity values are removed after reshaping.

The source dataset begins at 1900-01-01, and I retain the date values as provided in the original file so that the transformation pipeline remains fully reproducible.

renewable_tidy %>%
  summarise(
    rows_after_tidying = n(),
    countries = n_distinct(country_code),
    energy_types = n_distinct(energy_type),
    start_date = min(day),
    end_date = max(day)
  )
# A tibble: 1 × 5
  rows_after_tidying countries energy_types start_date end_date  
               <int>     <int>        <int> <date>     <date>    
1            1722474         8            8 1900-01-01 2020-12-02

Analytical Method

The original discussion suggested comparing renewable energy output across energy sources, across countries, and over time. To address this, I summarize total capacity by energy type, total capacity by country, and then aggregate daily totals by energy type to examine long-run time trends.

Analysis

renewable_by_energy <- renewable_tidy %>%
  group_by(energy_type) %>%
  summarise(
    total_capacity = sum(capacity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_capacity))

renewable_by_energy
# A tibble: 8 × 2
  energy_type   total_capacity
  <chr>                  <dbl>
1 wind              327252661.
2 wind onshore      318761019.
3 solar             210255953.
4 hydro             129437107.
5 bioenergy          82058286.
6 wind offshore      54058820.
7 marine               222916.
8 geothermal           125056.

The summary table above compares total renewable capacity across energy types.

renewable_by_energy %>%
  ggplot(aes(
    x = reorder(energy_type, total_capacity),
    y = total_capacity
  )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Total Renewable Capacity by Energy Type",
    x = "Energy type",
    y = "Total capacity"
  ) +
  scale_y_continuous(labels = label_comma())

This bar chart shows that wind related categories contribute the largest total capacity in the dataset. Solar and hydro also make large contributions, while geothermal and marine remain much smaller. This suggests that wind and solar are the dominant renewable technologies across the included countries.

Next, I compare total renewable capacity across countries.

renewable_by_country <- renewable_tidy %>%
  group_by(country_code) %>%
  summarise(
    total_capacity = sum(capacity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_capacity))

renewable_by_country
# A tibble: 8 × 2
  country_code total_capacity
  <chr>                 <dbl>
1 DE               507154526.
2 GB-UKM           214517654.
3 GB-GBN           207324910.
4 FR                82867640.
5 DK                59064658.
6 SE                45247120.
7 GB-NIR             3777263.
8 CH                 2218045.

The following plot compares the total renewable capacity of each country or region in the dataset.

renewable_by_country %>%
  ggplot(aes(
    x = reorder(country_code, total_capacity),
    y = total_capacity
  )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Total Renewable Capacity by Country Code",
    x = "Country code",
    y = "Total capacity"
  ) +
  scale_y_continuous(labels = label_comma())

This chart shows that Germany has the highest total renewable capacity in the dataset, followed by major United Kingdom regions. Countries or regions such as Switzerland and Northern Ireland contribute much smaller totals. These differences likely reflect varying levels of investment, infrastructure, and national renewable energy policy.

Finally, I examine how major renewable energy categories evolve over time.

renewable_over_time <- renewable_tidy %>%
  group_by(day, energy_type) %>%
  summarise(
    total_capacity = sum(capacity, na.rm = TRUE),
    .groups = "drop"
  )

renewable_over_time %>%
  ggplot(aes(x = day, y = total_capacity, color = energy_type)) +
  geom_line(linewidth = 0.7) +
  labs(
    title = "Renewable Capacity Over Time by Energy Type",
    x = "Date",
    y = "Total capacity",
    color = "Energy type"
  ) +
  scale_y_continuous(labels = label_comma())

This time-series plot shows how renewable capacity changes over time for each energy type. Wind and solar show strong upward growth patterns, which suggests expanded adoption of these technologies over time. Hydropower remains more stable, likely because major hydro infrastructure changes more slowly once built.

Interpretation

After tidying, this dataset makes it possible to compare renewable capacity across both countries and energy sources. The summary tables and plots show which renewable technologies contribute the most total capacity, which countries have the largest totals, and how the overall capacity profile changes over time. The tidy structure makes all of these comparisons much more direct than in the original wide format.

Dataset 3 — GDP by Country (1960–2022)

Data Source

Source: Kaggle, World GDP by Country (1960–2022). Raw CSV used in this project:

gdp_url <- "https://raw.githubusercontent.com/suffyankhan77/Project2-DATA-607/refs/heads/main/GDP.csv"

Data Structure Before Tidying

gdp_raw <- read_csv(gdp_url)

dim(gdp_raw)
[1] 266  65
names(gdp_raw)
 [1] "Country"      "Country Code" "1960"         "1961"         "1962"        
 [6] "1963"         "1964"         "1965"         "1966"         "1967"        
[11] "1968"         "1969"         "1970"         "1971"         "1972"        
[16] "1973"         "1974"         "1975"         "1976"         "1977"        
[21] "1978"         "1979"         "1980"         "1981"         "1982"        
[26] "1983"         "1984"         "1985"         "1986"         "1987"        
[31] "1988"         "1989"         "1990"         "1991"         "1992"        
[36] "1993"         "1994"         "1995"         "1996"         "1997"        
[41] "1998"         "1999"         "2000"         "2001"         "2002"        
[46] "2003"         "2004"         "2005"         "2006"         "2007"        
[51] "2008"         "2009"         "2010"         "2011"         "2012"        
[56] "2013"         "2014"         "2015"         "2016"         "2017"        
[61] "2018"         "2019"         "2020"         "2021"         "2022"        
glimpse(gdp_raw)
Rows: 266
Columns: 65
$ Country        <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan", …
$ `Country Code` <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB",…
$ `1960`         <dbl> NA, 21125015452, 537777811, 10447637853, NA, NA, NA, NA…
$ `1961`         <dbl> NA, 21616228139, 548888896, 11173212080, NA, NA, NA, NA…
$ `1962`         <dbl> NA, 23506279900, 546666678, 11990534018, NA, NA, NA, NA…
$ `1963`         <dbl> NA, 28048360188, 751111191, 12727688165, NA, NA, NA, NA…
$ `1964`         <dbl> NA, 25920665260, 800000044, 13898109284, NA, NA, NA, NA…
$ `1965`         <dbl> NA, 29472103270, 1006666638, 14929792388, NA, NA, NA, N…
$ `1966`         <dbl> NA, 32014368121, 1399999967, 15910837742, NA, NA, NA, N…
$ `1967`         <dbl> NA, 33269509510, 1673333418, 14510579889, NA, NA, NA, N…
$ `1968`         <dbl> NA, 36327785495, 1373333367, 14968235782, NA, NA, NA, 3…
$ `1969`         <dbl> NA, 41638967621, 1408888922, 16979315745, NA, NA, NA, 3…
$ `1970`         <dbl> NA, 44629891649, 1748886596, 23596163865, NA, NA, 78617…
$ `1971`         <dbl> NA, 49173371529, 1831108971, 20936358634, NA, NA, 89406…
$ `1972`         <dbl> NA, 53123459912, 1595555476, 25386169423, NA, NA, 11341…
$ `1973`         <dbl> NA, 69482723444, 1733333264, 31975594565, NA, NA, 15084…
$ `1974`         <dbl> NA, 85380645042, 2155555498, 44416677335, NA, NA, 18655…
$ `1975`         <dbl> NA, 90835426418, 2366666616, 51667190242, NA, NA, 22011…
$ `1976`         <dbl> NA, 90212747243, 2555555567, 62351622300, NA, NA, 22728…
$ `1977`         <dbl> NA, 102240575583, 2953333418, 65595122956, NA, NA, 2539…
$ `1978`         <dbl> NA, 116084638702, 3300000109, 71496496574, NA, NA, 3080…
$ `1979`         <dbl> NA, 134256827127, 3697940410, 88948338390, NA, NA, 4115…
$ `1980`         <dbl> NA, 171217790781, 3641723322, 112439126385, 5930503401,…
$ `1981`         <dbl> NA, 175859256874, 3478787909, 211338060015, 5550483036,…
$ `1982`         <dbl> NA, 168095657215, NA, 187448724920, 5550483036, NA, 375…
$ `1983`         <dbl> NA, 175564912386, NA, 138384182007, 5784341596, NA, 327…
$ `1984`         <dbl> NA, 160646748724, NA, 114516348921, 6131475065, 1857338…
$ `1985`         <dbl> NA, 136759437910, NA, 116776995133, 7554065410, 1897050…
$ `1986`         <dbl> 405586592, 153050335916, NA, 107886511309, 7072536109, …
$ `1987`         <dbl> 487709497, 186658478814, NA, 110728825942, 8084412414, …
$ `1988`         <dbl> 596648045, 204765985926, NA, 109438851254, 8769836769, …
$ `1989`         <dbl> 695530726, 218241607366, NA, 102254998563, 10201780977,…
$ `1990`         <dbl> 764804469, 254062093242, NA, 122387353859, 11229515599,…
$ `1991`         <dbl> 872067039, 276856728336, NA, 118039698016, 12704558517,…
$ `1992`         <dbl> 958659218, 246088124936, NA, 118893094122, 15114352005,…
$ `1993`         <dbl> 1083240223, 242926405780, NA, 99272180411, 11051939102,…
$ `1994`         <dbl> 1245810056, 239610677917, NA, 86636400266, 3390500000, …
$ `1995`         <dbl> 1320670391, 270327154575, NA, 108690885030, 5561222222,…
$ `1996`         <dbl> 1379888268, 269490833465, NA, 126287285163, 7526963964,…
$ `1997`         <dbl> 1531843575, 283446224788, NA, 127602388366, 7648377413,…
$ `1998`         <dbl> 1665363128, 266652333831, NA, 130678128885, 6506229607,…
$ `1999`         <dbl> 1722905028, 263024788890, NA, 138085971820, 6152922943,…
$ `2000`         <dbl> 1873184358, 284759318603, NA, 140945759314, 9129594819,…
$ `2001`         <dbl> 1896648045, 259643121973, NA, 148529518712, 8936079253,…
$ `2002`         <dbl> 1962011173, 266529432166, 3854235264, 177201164643, 152…
$ `2003`         <dbl> 2044134078, 354176768091, 4539496563, 205214466071, 178…
$ `2004`         <dbl> 2.254749e+09, 4.404818e+11, 5.220825e+09, 2.542648e+11,…
$ `2005`         <dbl> 2.359777e+09, 5.139416e+11, 6.226199e+09, 3.108896e+11,…
$ `2006`         <dbl> 2.469832e+09, 5.775869e+11, 6.971383e+09, 3.969210e+11,…
$ `2007`         <dbl> 2.677654e+09, 6.628680e+11, 9.715765e+09, 4.654855e+11,…
$ `2008`         <dbl> 2.843017e+09, 7.105362e+11, 1.024977e+10, 5.677912e+11,…
$ `2009`         <dbl> 2.553631e+09, 7.219012e+11, 1.215484e+10, 5.083627e+11,…
$ `2010`         <dbl> 2.453631e+09, 8.635195e+11, 1.563384e+10, 5.985216e+11,…
$ `2011`         <dbl> 2.637989e+09, 9.678246e+11, 1.819041e+10, 6.820159e+11,…
$ `2012`         <dbl> 2.615084e+09, 9.753548e+11, 2.020357e+10, 7.375895e+11,…
$ `2013`         <dbl> 2.727933e+09, 9.859871e+11, 2.056449e+10, 8.339481e+11,…
$ `2014`         <dbl> 2.791061e+09, 1.006526e+12, 2.055058e+10, 8.943225e+11,…
$ `2015`         <dbl> 2.963128e+09, 9.273485e+11, 1.999814e+10, 7.686447e+11,…
$ `2016`         <dbl> 2.983799e+09, 8.851764e+11, 1.801955e+10, 6.913634e+11,…
$ `2017`         <dbl> 3.092179e+09, 1.021043e+12, 1.889635e+10, 6.848988e+11,…
$ `2018`         <dbl> 3.276188e+09, 1.007196e+12, 1.841886e+10, 7.670257e+11,…
$ `2019`         <dbl> 3.395794e+09, 1.000834e+12, 1.890450e+10, 8.225384e+11,…
$ `2020`         <dbl> 2.610039e+09, 9.275933e+11, 2.014345e+10, 7.864600e+11,…
$ `2021`         <dbl> 3.126019e+09, 1.081998e+12, 1.458314e+10, 8.444597e+11,…
$ `2022`         <dbl> NA, 1.169484e+12, NA, 8.778633e+11, 1.067136e+11, 1.888…
gdp_raw %>% slice_head(n = 10)
# A tibble: 10 × 65
   Country  `Country Code`   `1960`   `1961`   `1962`   `1963`   `1964`   `1965`
   <chr>    <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Aruba    ABW            NA       NA       NA       NA       NA       NA      
 2 Africa … AFE             2.11e10  2.16e10  2.35e10  2.80e10  2.59e10  2.95e10
 3 Afghani… AFG             5.38e 8  5.49e 8  5.47e 8  7.51e 8  8.00e 8  1.01e 9
 4 Africa … AFW             1.04e10  1.12e10  1.20e10  1.27e10  1.39e10  1.49e10
 5 Angola   AGO            NA       NA       NA       NA       NA       NA      
 6 Albania  ALB            NA       NA       NA       NA       NA       NA      
 7 Andorra  AND            NA       NA       NA       NA       NA       NA      
 8 Arab Wo… ARB            NA       NA       NA       NA       NA       NA      
 9 United … ARE            NA       NA       NA       NA       NA       NA      
10 Argenti… ARG            NA       NA        2.45e10  1.83e10  2.56e10  2.83e10
# ℹ 57 more variables: `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
#   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
#   `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, …

The GDP dataset is wide because each year is stored as a separate column. This means the variable year is embedded in the column names instead of being represented as its own column.

Data Import and Tidying

gdp_tidy <- gdp_raw %>%
  pivot_longer(
    cols = -c(`Country`, `Country Code`),
    names_to = "year",
    values_to = "gdp"
  ) %>%
  rename(
    country = `Country`,
    country_code = `Country Code`
  ) %>%
  rename_with(tolower) %>%
  mutate(
    year = as.integer(year)
  ) %>%
  filter(!is.na(gdp))

glimpse(gdp_tidy)
Rows: 13,365
Columns: 4
$ country      <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "…
$ year         <int> 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 199…
$ gdp          <dbl> 405586592, 487709497, 596648045, 695530726, 764804469, 87…
gdp_tidy %>% slice_head(n = 10)
# A tibble: 10 × 4
   country country_code  year         gdp
   <chr>   <chr>        <int>       <dbl>
 1 Aruba   ABW           1986  405586592.
 2 Aruba   ABW           1987  487709497.
 3 Aruba   ABW           1988  596648045.
 4 Aruba   ABW           1989  695530726.
 5 Aruba   ABW           1990  764804469.
 6 Aruba   ABW           1991  872067039.
 7 Aruba   ABW           1992  958659218.
 8 Aruba   ABW           1993 1083240223 
 9 Aruba   ABW           1994 1245810056 
10 Aruba   ABW           1995 1320670391 

Missing and Inconsistent Value Handling

For this dataset, the main structural issue is that years are spread across columns. I normalize the dataset by reshaping those years into a single year variable. Missing GDP values are removed after reshaping.

gdp_tidy %>%
  summarise(
    rows_after_tidying = n(),
    countries = n_distinct(country),
    min_year = min(year),
    max_year = max(year)
  )
# A tibble: 1 × 4
  rows_after_tidying countries min_year max_year
               <int>     <int>    <int>    <int>
1              13365       262     1960     2022

Analytical Method

The original discussion proposed comparing GDP growth between countries and visualizing economic trends. To address this, I select five major economies, visualize GDP trends over time, compare their average GDP values across the full period, and calculate a simple long run growth multiple using the first and last available GDP observations.

Analysis

gdp_selected <- gdp_tidy %>%
  filter(country %in% c("United States", "China", "Japan", "Germany", "India"))

gdp_selected %>%
  ggplot(aes(x = year, y = gdp, color = country)) +
  geom_line(linewidth = 0.8) +
  labs(
    title = "GDP Trends for Selected Countries",
    x = "Year",
    y = "GDP (current US$)",
    color = "Country"
  ) +
  scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale()))

This line chart compares GDP trajectories for five major economies. The United States remains the largest economy across the full period, while China shows especially rapid growth beginning in the later decades. India also shows strong upward growth, although from a lower starting point. The chart highlights both differences in scale and differences in long run growth patterns.

Next, I calculate average GDP across the full period for selected countries.

gdp_average <- gdp_selected %>%
  group_by(country) %>%
  summarise(
    average_gdp = mean(gdp, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(average_gdp))

gdp_average
# A tibble: 5 × 2
  country       average_gdp
  <chr>               <dbl>
1 United States     8.21e12
2 China             3.06e12
3 Japan             2.90e12
4 Germany           2.13e12
5 India             7.41e11

The summary table above shows the average GDP of each selected country across the available years.

gdp_average %>%
  ggplot(aes(
    x = reorder(country, average_gdp),
    y = average_gdp
  )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average GDP for Selected Countries",
    x = "Country",
    y = "Average GDP (current US$)"
  ) +
  scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale()))

This bar chart shows that the United States has the highest average GDP across the full period, followed by China and Japan. India has the lowest average among the selected countries, which reflects the fact that its strongest economic expansion occurred later in the series.

To evaluate growth over time, I compare the first and last available GDP values for selected countries.

gdp_growth <- gdp_selected %>%
  group_by(country) %>%
  summarise(
    first_year = min(year),
    last_year = max(year),
    first_gdp = gdp[year == first_year][1],
    last_gdp = gdp[year == last_year][1],
    growth_multiple = last_gdp / first_gdp,
    .groups = "drop"
  ) %>%
  arrange(desc(growth_multiple))

gdp_growth
# A tibble: 5 × 6
  country       first_year last_year    first_gdp last_gdp growth_multiple
  <chr>              <int>     <int>        <dbl>    <dbl>           <dbl>
1 China               1960      2022  59716251765  1.80e13           301. 
2 Japan               1960      2022  44307342906  4.23e12            95.5
3 India               1960      2022  37029883847  3.39e12            91.4
4 United States       1960      2022 543300000000  2.55e13            46.9
5 Germany             1970      2022 215838448138  4.07e12            18.9

The table above compares long-run GDP growth using a simple growth multiple from the first available year to the last available year for each selected country. This helps identify which countries experienced the greatest proportional increase over time.

Interpretation

The tidy GDP dataset makes it straightforward to compare long-run economic trends across countries. The time-series graph shows differences in scale and growth trajectory, while the summary tables and bar chart highlight both average GDP size and long term growth multiples. This analysis demonstrates how reshaping the original wide dataset into tidy form supports clear economic comparison.

Conclusion

This project transformed three real-world datasets from wide format into tidy format using reproducible tidyr and dplyr workflows.

For the COVID-19 vaccination dataset, multiple vaccination measures were normalized into metric and value columns, making it possible to compare vaccination rates across countries and over time. For the renewable energy dataset, country and energy-type information embedded in column names were separated into explicit variables, allowing comparisons across technologies and countries. For the GDP dataset, yearly GDP columns were reshaped into a single year variable and a corresponding gdp value column, enabling cross-country economic trend analysis.

Across all three cases, the tidy structure made the datasets easier to summarize, visualize, and interpret. This project demonstrates how reshaping wide data into tidy form is an essential step for reliable downstream analysis.