Project 2 – Data Transformations: Approach

Author

Muhammad Suffyan Khan

Published

March 5, 2026

Introduction

The objective of this project is to develop practical experience transforming wide-format datasets into tidy datasets suitable for analysis. In many real-world datasets, variables are spread across multiple columns, which makes analysis and visualization difficult. The tidy data principles described by Hadley Wickham recommend structuring datasets so that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

Using the tidyr and dplyr packages in R, this project demonstrates how wide datasets can be transformed into tidy formats through reproducible data transformation pipelines. Each dataset is first preserved in its raw wide format and then converted into a tidy structure that can be used for analysis and visualization.

Three independent datasets from Discussion 5A are used in this project:

COVID-19 World Vaccination Progress
Renewable Energy Capacity Time Series
World GDP by Country (1960–2022)

Each dataset illustrates a different type of transformation scenario and allows meaningful analysis once the data is properly structured.

Dataset 1: COVID-19 World Vaccination Progress

Data Source

Dataset: COVID World Vaccination Progress Source: Kaggle Link: https://www.kaggle.com/datasets/gpreda/covid-world-vaccination-progress

This dataset contains vaccination statistics for countries around the world during the COVID-19 pandemic. The dataset includes multiple vaccination metrics such as total vaccinations, people vaccinated, people fully vaccinated, and daily vaccination rates.

Structure Before Tidying

The dataset includes several vaccination-related variables stored as separate columns. While the dataset contains useful information, the structure spreads similar measurements across multiple columns instead of representing them as a single variable describing the type of vaccination metric.

This structure is considered partially wide because the vaccination metrics represent the same type of measurement recorded across different variables.

Planned Transformation

The transformation process will reshape the dataset by converting the vaccination metric columns into a single categorical variable describing the metric type, while storing the corresponding values in another column.

This will be achieved using pivot_longer() from the tidyr package.

The transformation will result in a tidy dataset with variables such as:

country
date
metric
value

This structure allows easier filtering, grouping, and comparison across vaccination metrics.

Planned Analysis

After tidying the dataset, the analysis will examine vaccination progress across different countries. Visualizations will be created to show how vaccination rates change over time and how countries compare in terms of vaccination coverage.

Potential analysis includes:

Comparing vaccination rates between countries
Visualizing vaccination progress over time
Identifying countries with higher vaccination coverage

Dataset 2: Renewable Energy Capacity Time Series

Data Source

Dataset: Renewable Power Plants / Renewable Capacity Time Series Source: Kaggle Link: https://www.kaggle.com/datasets/eugeniyosetrov/renewable-power-plants

This dataset contains power generation data for multiple renewable energy technologies across several European countries. The data includes values for different energy sources such as solar, wind, hydro, and other renewable technologies.

Structure Before Tidying

The dataset is initially very wide. It includes columns representing different renewable energy sources as well as multiple columns representing different countries. The dataset also contains time series information representing energy generation across time.

Because these variables are stored across multiple columns rather than as values in a single variable column, the dataset is not in tidy format.

Planned Transformation

To tidy this dataset, the wide structure will be converted into a long format where:

energy type becomes a categorical variable
country becomes a variable
generation values are stored in a single column
time remains as an index variable

This transformation will allow the dataset to be analyzed more easily using grouping and visualization tools.

The main transformation will again use pivot_longer() to reshape the data.

Planned Analysis

After tidying the data, the following analyses will be performed:

Comparing renewable energy production across energy sources
Comparing renewable energy output between countries
Examining how renewable energy generation changes over time

These analyses will help illustrate differences in renewable energy usage and trends across countries and technologies.

Dataset 3: World GDP by Country (1960–2022)

Data Source

Dataset: World GDP by Country (1960–2022) Source: Kaggle Link: https://www.kaggle.com/datasets/annafabris/world-gdp-by-country-1960-2022

This dataset contains the gross domestic product (GDP) values for many countries across multiple years. Each country has GDP values listed for many different years within the same row.

Structure Before Tidying

The dataset stores each year as a separate column, which results in a wide structure. Instead of having a variable representing the year, the years appear as column names.

For example:

Country	1960	1961	1962	1963	…
USA	value	value	value	value	…
France	value	value	value	value	…
Japan	value	value	value	value	…

This format violates tidy data principles because the year variable is encoded as column headers rather than as a variable within the dataset.

Planned Transformation

The dataset will be transformed into tidy format by converting the year columns into a single year variable and placing GDP values into a corresponding gdp column.

This will be accomplished using pivot_longer().

The resulting tidy structure will resemble:

country | year | gdp |

This format allows the dataset to be easily analyzed and visualized using standard data analysis tools.

Planned Analysis

Once the dataset is tidy, the analysis will focus on economic trends across countries.

Possible analyses include:

Comparing GDP growth between countries
Calculating average GDP over time
Identifying countries with the fastest economic growth
Visualizing GDP trends using time series plots

Reproducibility Plan

All data transformations and analyses will be implemented in a Quarto Markdown file using the tidyr, dplyr, and ggplot2 packages in R.

The workflow will follow these steps for each dataset:

Import the raw wide-format dataset from a CSV file.
Inspect the structure of the raw dataset.
Apply pivot_longer() and other dplyr transformations to create a tidy dataset.
Perform analysis using the tidy dataset.
Create summary tables and visualizations using ggplot2.

All code will be fully reproducible and executable from a clean R session.

Expected Outcome

By completing these transformations, the project will demonstrate how wide-format datasets commonly encountered in real-world data sources can be reshaped into tidy structures suitable for analysis.

The resulting tidy datasets will allow more flexible analysis, easier visualization, and clearer interpretation of trends in vaccination progress, renewable energy generation, and economic growth across countries.

Conclusion

This project outlines an approach for transforming three real-world datasets from wide format into tidy format using the tidyr and dplyr packages in R. The selected datasets include COVID-19 vaccination progress, renewable energy capacity time series, and GDP by country across multiple years. Each dataset contains structural issues that make analysis difficult in their raw wide format.

By applying reproducible transformation pipelines such as pivot_longer(), these datasets will be converted into tidy structures that support efficient analysis and visualization. Once the datasets are properly structured, meaningful analyses can be performed to compare vaccination progress across countries, examine renewable energy generation patterns, and analyze GDP growth trends over time.

This approach ensures that all transformations and analyses will be reproducible, transparent, and aligned with tidy data principles.