Project 2 Data 607

Author

Ciara Bonnett

Published

March 8, 2026

Code

renewable_data <- tribble(
  ~Country, ~`1990`, ~`2000`, ~`2010`, ~`2020`,
  "United States", 3.4, 4.2, 6.1, 9.8,
  "Brazil", 45.2, 42.1, 47.5, 49.1,
  "Germany", 1.8, 3.5, 10.2, 17.5,
  "China", 25.3, 20.1, 13.2, 14.8,
  "India", 50.1, 45.2, 38.4, 32.1
)
write_csv(renewable_data, "renewable_energy.csv")

Code

takeout_data <- tribble(
 ~generation, ~jan_2025, ~feb_2025, ~mar_2025, ~apr_2025, ~may_2025, ~jun_2025,   "Gen Z", 150, 165, 145, 170, 160, 180,
  "Millennials", 210, 200, 225, 215, 230, 240, 
  "Gen X", 190, 185, 195, 190, 205, 210,
  "Baby Boomers", 120, 115, 130, 125, 110, 120, 
  "Silent Generation", 80, 75, 85, 80, 90, 85
)
write_csv(takeout_data, "takeout_spending.csv")

Code

soccer_data <- tribble(
  ~player, ~goals_home, ~goals_away, ~assists_home, ~assists_away,
  "Salah", 12, 7, 5, 4, 
  "Haaland", 15, 10, 2, 3, 
  "Saka", 8, 6, 6, 5, 
  "Son", 7, 9, 3, 4
)
write_csv(soccer_data, "soccer_stats.csv")

Introduction

In this project, I am transforming three distinct wide-format datasets into a tidy format to enable better downstream analysis.

Approach

First Dataset: I am using my post data set which is the World Bank Renewable Energy Consumption dataset. The wide problem for this dataset is its untidy becaise of the years being treated as the column headers. In a tidy format, “Year” should be a single variable in one column on its own. I will use read_csv() to pull the raw data from GitHub. Then, I will apply pivot_longer() to consolidate the year columns and mutate() to ensure the years are stored as numeric values.

Second Dataset: I am using Kiera Griffths’ post on generational spending on takeout. The wide problem here are the momths being used as column headers, which prevents us from easily calculating averages across the entire time period. I will use pivot_longer() to create Month column and a Spending column. I will use separate to split the month strings if needed for better sorting.

Third Dataset: I am using Pascal Hermann’s post about player performance metrics. This dataset is actually a complex wide format where column names contain two variables: the stat(Goals/Assists) and the venue (Home/Away). I will use pivot_longer() with the .value sentinel to split the column names at the underscore, creating a tidy version where Venue is its own column.

Challenges

For the first data I anticipate R reading the data like character strings. I have to be sure they convert to numeric format so that the trend displays as chronologically and treats the time between years correctly.

For the second one I see the months and years combining and alphabetically sorting it. I am using the separate() to create distinction and possibly using factor to ensure the months are in order how they are meant to be chronologically.

Lastly, the third dataset presents a multi-variable-per column problem. I am going to have to use a regex or specifics to split the values. I will also need to ensure that the data remains balanced so that ‘Salah’s’ home goals and home assists stay linked to the Home venue.

##Code Deliverable Renewable Energy

Code

library(tidyverse)

# 3.2 Import and Tidy
url_renewable <- "https://raw.githubusercontent.com/CiaraBonn12/Project-2-Data-607/refs/heads/main/renewable_energy.csv"
raw_renewable <- read_csv(url_renewable)

tidy_renewable <- raw_renewable %>%
  # Pivot all year columns (1900 to 2020)
  pivot_longer(
    cols = `1990`:`2020`,
    names_to = "year",
    values_to = "renewable_pct"
  ) %>%
  # Fix the anticipated challenge: convert year to numeric mutate(year = as. numeric(year)) %>%
  # Normalize naming convention
  rename_with(tolower)

#3.3 Analysis: Line chart to show trends over time
ggplot(tidy_renewable, aes(x = year, y = renewable_pct, color = country)) +
  geom_line(size = 1) +
  geom_point() +
  labs(title = "Renewable Energy Consumption Trends", x = "Year", y = "Percent of Total Consumption") + theme_minimal()

Takeout Spending

Code

# 3.2
url_takeout <- "https://raw.githubusercontent.com/CiaraBonn12/Project-2-Data-607/refs/heads/main/takeout_spending.csv"
raw_takeout <- read_csv(url_takeout)

tidy_takeout <-raw_takeout %>%
  pivot_longer(
    cols = -generation,
    names_to = "month_year",
    values_to = "spending"
  ) %>%
  # Separate 'jan_2025' into 'month' and 'year'
  separate(month_year, into = c("month", "year"), sep = "_")

# 3.3 Analysis
summary_takeout <- tidy_takeout %>%
  group_by(generation) %>%
  summarize(avg_monthly_spend = mean(spending))

#Create a clean table output
knitr::kable(summary_takeout, caption = "Average Monthly Takeout Spending")

Average Monthly Takeout Spending
generation	avg_monthly_spend
Baby Boomers	120.0000
Gen X	195.8333
Gen Z	161.6667
Millennials	220.0000
Silent Generation	82.5000

Soccer Stats

Code

url_soccer <- "https://raw.githubusercontent.com/CiaraBonn12/Project-2-Data-607/refs/heads/main/soccer_stats.csv"
raw_soccer <- read_csv(url_soccer)

tidy_soccer <-raw_soccer %>%
  pivot_longer(
    cols = -player,
    # .value tells R that 'goals' and 'assists' are the new column names
    # 'venue' is the new variable (Home/Away)
    names_to = c(".value", "venue"),
    names_sep = "_"
  )

# 3.3 Analysis
ggplot(tidy_soccer, aes(x = player, y = goals, fill = venue)) +
  geom_bar(stat = "identity", position = "dodge") + 
  labs(title = "Home vs Away Goals Scored", 
       x = "Player", y = "Total Goals") +
  theme_light()

Conclusions

Renewable Energy Trends The analysis reveals a stark contrast in energy transition between 1990 and 2020. While Brazil maintains a dominant lead with nearly 50% of its energy coming from renewable sources, India shows a significant downward trend, dropping from over 50% in 1990 to roughly 32% in 2020.This may tell us that while Brazil successfully scaled its green infrastructure, India’s rapid industrialization during this period likely relied more heavily on non-renewable sources to meet growing demand.

Generational Takeout Spending The tidied data confirms that Millennials are the highest spenders on takeout, averaging over $220 per month, followed closely by Gen X. In contrast, the Silent Generation averages the lowest at about $82. This data supports the hypothesis that younger generations prioritize the convenience of prepared meals significantly more than older cohorts, who may have more established cooking habits or lower discretionary spending on food services.

Soccer Performances By splitting the metrics by venue, we can see a clear “Home Field Advantage” for elite scores like Haaland, who scored 15 goals at home versus 10 away. Interestingly, Son is the outlier in this dataset, performing better in away matches than at home. This level of insight was only possible after the transformation, proving that combining two variables into a single column header hides critical performance patterns.

AI Transcript

Appendix: AI Interaction Log (Citations) Model: Gemini 3 Flash

Date of Interaction: March 7–8, 2026

Prompt 1: Data Generation User Prompt: “How do I create a csv for the rest of the datasets? I have Pascal’s soccer stats and Kiera’s takeout spending.”

AI Response: Provided tribble() code blocks to structure the wide data and the write_csv() command to save them locally. It explained the need for a consistent number of elements in each row to avoid vec_c() errors.

Prompt 2: Advanced Tidying Logic User Prompt: “How do I tidy the soccer data where the columns are goals_home, goals_away, etc.?”

AI Response: Suggested using pivot_longer() with names_to = c(“.value”, “venue”) and names_sep = “_“.

Key Learning: I learned that the .value sentinel acts as a placeholder that tells R to take part of the column name and turn it into the actual column header, while the other part becomes a row value.

Prompt 3: Troubleshooting YAML and Rendering User Prompt: “Is this YAML better? [User provided YAML with indentation errors]”

AI Response: Corrected the indentation under the html: and execute: blocks. It explained that Quarto is space-sensitive and provided a “Gold Standard” YAML header to ensure the Table of Contents (TOC) and code-folding worked on RPubs.