Project 2 – Data Transformations: Approach
Introduction
The objective of this project is to develop practical experience transforming wide-format datasets into tidy datasets suitable for analysis. In many real-world datasets, variables are spread across multiple columns, which makes analysis and visualization difficult. The tidy data principles described by Hadley Wickham recommend structuring datasets so that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.
Using the tidyr and dplyr packages in R, this project demonstrates how wide datasets can be transformed into tidy formats through reproducible data transformation pipelines. Each dataset is first preserved in its raw wide format and then converted into a tidy structure that can be used for analysis and visualization.
Three independent datasets from Discussion 5A are used in this project:
- COVID-19 World Vaccination Progress
- Renewable Energy Capacity Time Series
- World GDP by Country (1960–2022)
Each dataset illustrates a different type of transformation scenario and allows meaningful analysis once the data is properly structured.
Dataset 1: COVID-19 World Vaccination Progress
Data Source
Dataset: COVID World Vaccination Progress Source: Kaggle Link: https://www.kaggle.com/datasets/gpreda/covid-world-vaccination-progress
This dataset contains vaccination statistics for countries around the world during the COVID-19 pandemic. The dataset includes multiple vaccination metrics such as total vaccinations, people vaccinated, people fully vaccinated, and daily vaccination rates.
Structure Before Tidying
The dataset includes several vaccination-related variables stored as separate columns. While the dataset contains useful information, the structure spreads similar measurements across multiple columns instead of representing them as a single variable describing the type of vaccination metric.
This structure is considered partially wide because the vaccination metrics represent the same type of measurement recorded across different variables.
Planned Transformation
The transformation process will reshape the dataset by converting the vaccination metric columns into a single categorical variable describing the metric type, while storing the corresponding values in another column.
This will be achieved using pivot_longer() from the tidyr package.
The transformation will result in a tidy dataset with variables such as:
- country
- date
- metric
- value
This structure allows easier filtering, grouping, and comparison across vaccination metrics.
Planned Analysis
After tidying the dataset, the analysis will examine vaccination progress across different countries. Visualizations will be created to show how vaccination rates change over time and how countries compare in terms of vaccination coverage.
Potential analysis includes:
- Comparing vaccination rates between countries
- Visualizing vaccination progress over time
- Identifying countries with higher vaccination coverage
Dataset 2: Renewable Energy Capacity Time Series
Data Source
Dataset: Renewable Power Plants / Renewable Capacity Time Series Source: Kaggle Link: https://www.kaggle.com/datasets/eugeniyosetrov/renewable-power-plants
This dataset contains power generation data for multiple renewable energy technologies across several European countries. The data includes values for different energy sources such as solar, wind, hydro, and other renewable technologies.
Structure Before Tidying
The dataset is initially very wide. It includes columns representing different renewable energy sources as well as multiple columns representing different countries. The dataset also contains time series information representing energy generation across time.
Because these variables are stored across multiple columns rather than as values in a single variable column, the dataset is not in tidy format.
Planned Transformation
To tidy this dataset, the wide structure will be converted into a long format where:
- energy type becomes a categorical variable
- country becomes a variable
- generation values are stored in a single column
- time remains as an index variable
This transformation will allow the dataset to be analyzed more easily using grouping and visualization tools.
The main transformation will again use pivot_longer() to reshape the data.
Planned Analysis
After tidying the data, the following analyses will be performed:
- Comparing renewable energy production across energy sources
- Comparing renewable energy output between countries
- Examining how renewable energy generation changes over time
These analyses will help illustrate differences in renewable energy usage and trends across countries and technologies.
Dataset 3: World GDP by Country (1960–2022)
Data Source
Dataset: World GDP by Country (1960–2022) Source: Kaggle Link: https://www.kaggle.com/datasets/annafabris/world-gdp-by-country-1960-2022
This dataset contains the gross domestic product (GDP) values for many countries across multiple years. Each country has GDP values listed for many different years within the same row.
Structure Before Tidying
The dataset stores each year as a separate column, which results in a wide structure. Instead of having a variable representing the year, the years appear as column names.
For example:
| Country | 1960 | 1961 | 1962 | 1963 | … |
|---|---|---|---|---|---|
| USA | value | value | value | value | … |
| France | value | value | value | value | … |
| Japan | value | value | value | value | … |
This format violates tidy data principles because the year variable is encoded as column headers rather than as a variable within the dataset.
Planned Transformation
The dataset will be transformed into tidy format by converting the year columns into a single year variable and placing GDP values into a corresponding gdp column.
This will be accomplished using pivot_longer().
The resulting tidy structure will resemble:
This format allows the dataset to be easily analyzed and visualized using standard data analysis tools.
Planned Analysis
Once the dataset is tidy, the analysis will focus on economic trends across countries.
Possible analyses include:
- Comparing GDP growth between countries
- Calculating average GDP over time
- Identifying countries with the fastest economic growth
- Visualizing GDP trends using time series plots
Reproducibility Plan
All data transformations and analyses will be implemented in a Quarto Markdown file using the tidyr, dplyr, and ggplot2 packages in R.
The workflow will follow these steps for each dataset:
- Import the raw wide-format dataset from a CSV file.
- Inspect the structure of the raw dataset.
- Apply
pivot_longer()and otherdplyrtransformations to create a tidy dataset. - Perform analysis using the tidy dataset.
- Create summary tables and visualizations using
ggplot2.
All code will be fully reproducible and executable from a clean R session.
Expected Outcome
By completing these transformations, the project will demonstrate how wide-format datasets commonly encountered in real-world data sources can be reshaped into tidy structures suitable for analysis.
The resulting tidy datasets will allow more flexible analysis, easier visualization, and clearer interpretation of trends in vaccination progress, renewable energy generation, and economic growth across countries.
Conclusion
This project outlines an approach for transforming three real-world datasets from wide format into tidy format using the tidyr and dplyr packages in R. The selected datasets include COVID-19 vaccination progress, renewable energy capacity time series, and GDP by country across multiple years. Each dataset contains structural issues that make analysis difficult in their raw wide format.
By applying reproducible transformation pipelines such as pivot_longer(), these datasets will be converted into tidy structures that support efficient analysis and visualization. Once the datasets are properly structured, meaningful analyses can be performed to compare vaccination progress across countries, examine renewable energy generation patterns, and analyze GDP growth trends over time.
This approach ensures that all transformations and analyses will be reproducible, transparent, and aligned with tidy data principles.