Project 2 Data 607

Author

Ciara Bonnett

Published

March 5, 2026

Code
renewable_data <- tribble(
  ~Country, ~`1990`, ~`2000`, ~`2010`, ~`2020`,
  "United States", 3.4, 4.2, 6.1, 9.8,
  "Brazil", 45.2, 42.1, 47.5, 49.1,
  "Germany", 1.8, 3.5, 10.2, 17.5,
  "China", 25.3, 20.1, 13.2, 14.8,
  "India", 50.1, 45.2, 38.4, 32.1
)
write_csv(renewable_data, "renewable_energy.csv")
Code
takeout_data <- tribble(
 ~generation, ~jan_2025, ~feb_2025, ~mar_2025, ~apr_2025, ~may_2025, ~jun_2025,   "Gen Z", 150, 165, 145, 170, 160, 180,
  "Millennials", 210, 200, 225, 215, 230, 240, 
  "Gen X", 190, 185, 195, 190, 205, 210,
  "Baby Boomers", 120, 115, 130, 125, 110, 120, 
  "Silent Generation", 80, 75, 85, 80, 90, 85
)
write_csv(takeout_data, "takeout_spending.csv")
Code
soccer_data <- tribble(
  ~player, ~goals_home, ~goals_away, ~assists_home, ~assists_away,
  "Salah", 12, 7, 5, 4, 
  "Haaland", 15, 10, 2, 3, 
  "Saka", 8, 6, 6, 5, 
  "Son", 7, 9, 3, 4
)
write_csv(soccer_data, "soccer_stats.csv")

Introduction

In this project, I am transforming three distinct wide-format datasets into a tidy format to enable better downstream analysis.

Approach

First Dataset: I am using my post data set which is the World Bank Renewable Energy Consumption dataset. The wide problem for this dataset is its untidy becaise of the years being treated as the column headers. In a tidy format, “Year” should be a single variable in one column on its own. I will use read_csv() to pull the raw data from GitHub. Then, I will apply pivot_longer() to consolidate the year columns and mutate() to ensure the years are stored as numeric values.

Second Dataset: I am using Kiera Griffths’ post on generational spending on takeout. The wide problem here are the momths being used as column headers, which prevents us from easily calculating averages across the entire time period. I will use pivot_longer() to create Month column and a Spending column. I will use separate to split the month strings if needed for better sorting.

Third Dataset: I am using Pascal Hermann’s post about player performance metrics. This dataset is actually a complex wide format where column names contain two variables: the stat(Goals/Assists) and the venue (Home/Away). I will use pivot_longer() with the .value sentinel to split the column names at the underscore, creating a tidy version where Venue is its own column.

Challenges

For the first data I anticipate R reading the data like character strings. I have to be sure they convert to numeric format so that the trend displays as chronologically and treats the time between years correctly.

For the second one I see the months and years combining and alphabetically sorting it. I am using the separate() to create distinction and possibly using factor to ensure the months are in order how they are meant to be chronologically.

Lastly, the third dataset presents a multi-variable-per column problem. I am going to have to use a regex or specifics to split the values. I will also need to ensure that the data remains balanced so that ‘Salah’s’ home goals and home assists stay linked to the Home venue.