Project 2

Author

Sinem K Moschos

Approach

Introduction

For Project 2, the goal is to practice cleaning and reshaping datasets that are in a wide format. I selected three datasets from DATA607 Course Week 5 Discussion.

The three datasets I chose are:

  1. World GDP by Country https://www.kaggle.com/datasets/annafabris/world-gdp-by-country-1960-2022

Raw Link: https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Week6-Project2/GDP.csv

  1. Renewable Energy Consumption https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS

Raw Link: https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Week6-Project2/API_EG.FEC.RNEW.ZS_DS2_en_csv_v2_4948.csv

  1. Fossil CO₂ Emissions by Country https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions#Fossil_carbon_dioxide_emissions_by_country

Source Link: https://github.com/sinemkilicdere/Data607/blob/main/Week6-Project2/EDGAR_2024_GHG_booklet_2024_fossilCO2only.csv

All three datasets contain yearly values spread across many columns. This makes them examples of wide format data. The main purpose of this project is to transform these wide datasets into tidy format so they can be analyzed more easily.

Why These Datasets Are Wide

In all three datasets, the structure looks like this:

  • One row per country
  • Many columns representing different years

This format is easy to read, but it is not ideal for analysis because:

  • The years are stored as column names instead of values
  • It is harder to filter
  • Many columns contain missing values

To properly analyze time based data, it is better to have:

  • One column for Country
  • One column for Year
  • One column for the Value

Step 1: Create and Save the CSV File

First, I will download each dataset and save it as a .csv file. I will keep the original wide structure on purpose, because the goal of this assignment is to practice tidying and transformations. Each dataset will be uploaded to my GitHub repository.

Step 2: Load the Dataset into R

I will load each CSV file into R using read_csv().

After loading the dataset, I will:

  • View the first few rows
  • Use glimpse() to check the structure
  • Make sure numeric columns are correctly formatted
  • Check for missing values

Step 3: Tidy and Transform the Data

The datasets have years as columns, I will use:

  • pivot_longer() to convert year columns into a single Year column
  • Clean numeric values if there are commas or formatting issues
  • Convert the Year column to numeric
  • Handle missing values (NA)
  • Possibly filter out summary rows like “World” if needed

After this step, each dataset should have a structure like:

Country | Year | Value

Step 4: Perform the Required Analysis

After tidying each dataset, I will perform analysis such as:

  • Calculating trends over time
  • Identifying highest and lowest countries
  • Comparing values across countries
  • Grouping and summarizing data

Step 5: Conclusions

For each dataset, I will write a short conclusion explaining:

  • What transformations I performed
  • What patterns or trends I observed
  • Any limitations in the data

Dataset 1 — World GDP by Country

Data source

I used the World GDP by country dataset from Kaggle. The original file is wide format (years are columns).

Source: https://www.kaggle.com/datasets/annafabris/world-gdp-by-country-1960-2022

Step 1.1 Load packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(607)

Step 1.2 — Import the raw wide data

gdp_raw <- read_csv(
  "https://raw.githubusercontent.com/sinemkilicdere/Data607/main/Week6-Project2/GDP.csv",
  show_col_types = FALSE
)

What does the raw data look like?

dim(gdp_raw)
[1] 266  65
names(gdp_raw)[1:12]
 [1] "Country"      "Country Code" "1960"         "1961"         "1962"        
 [6] "1963"         "1964"         "1965"         "1966"         "1967"        
[11] "1968"         "1969"        
gdp_raw %>% 
  select(1:8) %>% 
  slice(1:6)
# A tibble: 6 × 8
  Country   `Country Code`   `1960`   `1961`   `1962`   `1963`   `1964`   `1965`
  <chr>     <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Aruba     ABW            NA       NA       NA       NA       NA       NA      
2 Africa E… AFE             2.11e10  2.16e10  2.35e10  2.80e10  2.59e10  2.95e10
3 Afghanis… AFG             5.38e 8  5.49e 8  5.47e 8  7.51e 8  8.00e 8  1.01e 9
4 Africa W… AFW             1.04e10  1.12e10  1.20e10  1.27e10  1.39e10  1.49e10
5 Angola    AGO            NA       NA       NA       NA       NA       NA      
6 Albania   ALB            NA       NA       NA       NA       NA       NA      
glimpse(gdp_raw)
Rows: 266
Columns: 65
$ Country        <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan", …
$ `Country Code` <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB",…
$ `1960`         <dbl> NA, 21125015452, 537777811, 10447637853, NA, NA, NA, NA…
$ `1961`         <dbl> NA, 21616228139, 548888896, 11173212080, NA, NA, NA, NA…
$ `1962`         <dbl> NA, 23506279900, 546666678, 11990534018, NA, NA, NA, NA…
$ `1963`         <dbl> NA, 28048360188, 751111191, 12727688165, NA, NA, NA, NA…
$ `1964`         <dbl> NA, 25920665260, 800000044, 13898109284, NA, NA, NA, NA…
$ `1965`         <dbl> NA, 29472103270, 1006666638, 14929792388, NA, NA, NA, N…
$ `1966`         <dbl> NA, 32014368121, 1399999967, 15910837742, NA, NA, NA, N…
$ `1967`         <dbl> NA, 33269509510, 1673333418, 14510579889, NA, NA, NA, N…
$ `1968`         <dbl> NA, 36327785495, 1373333367, 14968235782, NA, NA, NA, 3…
$ `1969`         <dbl> NA, 41638967621, 1408888922, 16979315745, NA, NA, NA, 3…
$ `1970`         <dbl> NA, 44629891649, 1748886596, 23596163865, NA, NA, 78617…
$ `1971`         <dbl> NA, 49173371529, 1831108971, 20936358634, NA, NA, 89406…
$ `1972`         <dbl> NA, 53123459912, 1595555476, 25386169423, NA, NA, 11341…
$ `1973`         <dbl> NA, 69482723444, 1733333264, 31975594565, NA, NA, 15084…
$ `1974`         <dbl> NA, 85380645042, 2155555498, 44416677335, NA, NA, 18655…
$ `1975`         <dbl> NA, 90835426418, 2366666616, 51667190242, NA, NA, 22011…
$ `1976`         <dbl> NA, 90212747243, 2555555567, 62351622300, NA, NA, 22728…
$ `1977`         <dbl> NA, 102240575583, 2953333418, 65595122956, NA, NA, 2539…
$ `1978`         <dbl> NA, 116084638702, 3300000109, 71496496574, NA, NA, 3080…
$ `1979`         <dbl> NA, 134256827127, 3697940410, 88948338390, NA, NA, 4115…
$ `1980`         <dbl> NA, 171217790781, 3641723322, 112439126385, 5930503401,…
$ `1981`         <dbl> NA, 175859256874, 3478787909, 211338060015, 5550483036,…
$ `1982`         <dbl> NA, 168095657215, NA, 187448724920, 5550483036, NA, 375…
$ `1983`         <dbl> NA, 175564912386, NA, 138384182007, 5784341596, NA, 327…
$ `1984`         <dbl> NA, 160646748724, NA, 114516348921, 6131475065, 1857338…
$ `1985`         <dbl> NA, 136759437910, NA, 116776995133, 7554065410, 1897050…
$ `1986`         <dbl> 405586592, 153050335916, NA, 107886511309, 7072536109, …
$ `1987`         <dbl> 487709497, 186658478814, NA, 110728825942, 8084412414, …
$ `1988`         <dbl> 596648045, 204765985926, NA, 109438851254, 8769836769, …
$ `1989`         <dbl> 695530726, 218241607366, NA, 102254998563, 10201780977,…
$ `1990`         <dbl> 764804469, 254062093242, NA, 122387353859, 11229515599,…
$ `1991`         <dbl> 872067039, 276856728336, NA, 118039698016, 12704558517,…
$ `1992`         <dbl> 958659218, 246088124936, NA, 118893094122, 15114352005,…
$ `1993`         <dbl> 1083240223, 242926405780, NA, 99272180411, 11051939102,…
$ `1994`         <dbl> 1245810056, 239610677917, NA, 86636400266, 3390500000, …
$ `1995`         <dbl> 1320670391, 270327154575, NA, 108690885030, 5561222222,…
$ `1996`         <dbl> 1379888268, 269490833465, NA, 126287285163, 7526963964,…
$ `1997`         <dbl> 1531843575, 283446224788, NA, 127602388366, 7648377413,…
$ `1998`         <dbl> 1665363128, 266652333831, NA, 130678128885, 6506229607,…
$ `1999`         <dbl> 1722905028, 263024788890, NA, 138085971820, 6152922943,…
$ `2000`         <dbl> 1873184358, 284759318603, NA, 140945759314, 9129594819,…
$ `2001`         <dbl> 1896648045, 259643121973, NA, 148529518712, 8936079253,…
$ `2002`         <dbl> 1962011173, 266529432166, 3854235264, 177201164643, 152…
$ `2003`         <dbl> 2044134078, 354176768091, 4539496563, 205214466071, 178…
$ `2004`         <dbl> 2.254749e+09, 4.404818e+11, 5.220825e+09, 2.542648e+11,…
$ `2005`         <dbl> 2.359777e+09, 5.139416e+11, 6.226199e+09, 3.108896e+11,…
$ `2006`         <dbl> 2.469832e+09, 5.775869e+11, 6.971383e+09, 3.969210e+11,…
$ `2007`         <dbl> 2.677654e+09, 6.628680e+11, 9.715765e+09, 4.654855e+11,…
$ `2008`         <dbl> 2.843017e+09, 7.105362e+11, 1.024977e+10, 5.677912e+11,…
$ `2009`         <dbl> 2.553631e+09, 7.219012e+11, 1.215484e+10, 5.083627e+11,…
$ `2010`         <dbl> 2.453631e+09, 8.635195e+11, 1.563384e+10, 5.985216e+11,…
$ `2011`         <dbl> 2.637989e+09, 9.678246e+11, 1.819041e+10, 6.820159e+11,…
$ `2012`         <dbl> 2.615084e+09, 9.753548e+11, 2.020357e+10, 7.375895e+11,…
$ `2013`         <dbl> 2.727933e+09, 9.859871e+11, 2.056449e+10, 8.339481e+11,…
$ `2014`         <dbl> 2.791061e+09, 1.006526e+12, 2.055058e+10, 8.943225e+11,…
$ `2015`         <dbl> 2.963128e+09, 9.273485e+11, 1.999814e+10, 7.686447e+11,…
$ `2016`         <dbl> 2.983799e+09, 8.851764e+11, 1.801955e+10, 6.913634e+11,…
$ `2017`         <dbl> 3.092179e+09, 1.021043e+12, 1.889635e+10, 6.848988e+11,…
$ `2018`         <dbl> 3.276188e+09, 1.007196e+12, 1.841886e+10, 7.670257e+11,…
$ `2019`         <dbl> 3.395794e+09, 1.000834e+12, 1.890450e+10, 8.225384e+11,…
$ `2020`         <dbl> 2.610039e+09, 9.275933e+11, 2.014345e+10, 7.864600e+11,…
$ `2021`         <dbl> 3.126019e+09, 1.081998e+12, 1.458314e+10, 8.444597e+11,…
$ `2022`         <dbl> NA, 1.169484e+12, NA, 8.778633e+11, 1.067136e+11, 1.888…

What I notice: • Each row is a country (sometimes a region). • Each year (like 1960, 1961, 1962…) is its own column. • There are lots of missing values (NA).

Step 1.3 — Tidying and transformation

  1. Wide → long using pivot_longer()

Right now, years are spread across many columns. I want a tidy dataset where: • one column is the year • one column is the GDP value

gdp_tidy <- gdp_raw %>%
  pivot_longer(
    cols = matches("^\\d{4}$"),
    names_to = "year",
    values_to = "gdp_usd"
  )
  1. Rename variables to be consistent

Some column names have spaces or capital letters. I renamed them to simple lowercase names.

gdp_tidy <- gdp_tidy %>%
  rename(
    country = Country,
    country_code = `Country Code`
  ) %>%
  rename_with(tolower)
  1. Fix data types

After pivoting, the year column becomes text. I convert it to an integer.

gdp_tidy <- gdp_tidy %>%
  mutate(year = as.integer(year))
  1. Missing values decision

GDP is a number, so if it is missing (NA), I cannot use it in calculations. My decision: remove rows where GDP is missing.

gdp_tidy <- gdp_tidy %>%
  drop_na(gdp_usd)

Check the tidy data

glimpse(gdp_tidy)
Rows: 13,365
Columns: 4
$ country      <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "…
$ year         <int> 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 199…
$ gdp_usd      <dbl> 405586592, 487709497, 596648045, 695530726, 764804469, 87…
gdp_tidy %>% slice(1:6)
# A tibble: 6 × 4
  country country_code  year    gdp_usd
  <chr>   <chr>        <int>      <dbl>
1 Aruba   ABW           1986 405586592.
2 Aruba   ABW           1987 487709497.
3 Aruba   ABW           1988 596648045.
4 Aruba   ABW           1989 695530726.
5 Aruba   ABW           1990 764804469.
6 Aruba   ABW           1991 872067039.

Now it is tidy because: • Each row is one country in one year. • The main columns are: country, country_code, year, and gdp_usd.

Step 1.4 — Analysis

A) Find the most recent year in the dataset

latest_year <- max(gdp_tidy$year)
latest_year
[1] 2022

B) Top 10 countries by GDP in that year

top10_latest <- gdp_tidy %>%
  filter(year == latest_year) %>%
  filter(
    country != "World",
    !str_detect(country, "income"),
    !str_detect(country, "OECD"),
    !str_detect(country, "dividend"),
    !str_detect(country, "total"),
    !str_detect(country, "IBRD"),
    !str_detect(country, "IDA"),
    !str_detect(country, "&")
  ) %>%
  arrange(desc(gdp_usd)) %>%
  slice(1:10)

top10_latest
# A tibble: 10 × 4
   country        country_code  year gdp_usd
   <chr>          <chr>        <int>   <dbl>
 1 North America  NAC           2022 2.76e13
 2 United States  USA           2022 2.55e13
 3 China          CHN           2022 1.80e13
 4 European Union EUU           2022 1.66e13
 5 Euro area      EMU           2022 1.40e13
 6 South Asia     SAS           2022 4.36e12
 7 Japan          JPN           2022 4.23e12
 8 Germany        DEU           2022 4.07e12
 9 Arab World     ARB           2022 3.56e12
10 India          IND           2022 3.39e12

C) Bar chart for the top 10

top10_latest <- gdp_tidy %>%
  filter(year == latest_year) %>%
  filter(nchar(country_code) == 3) %>%
  arrange(desc(gdp_usd)) %>%
  slice(1:10)

D) GDP trend over time for the top 5 (from the latest year)

First, I pick the top 5 countries from the latest year.

top5_countries <- top10_latest %>%
  slice(1:5) %>%
  pull(country)

top5_countries
[1] "World"                     "High income"              
[3] "OECD members"              "Post-demographic dividend"
[5] "IDA & IBRD total"         
gdp_tidy %>%
  filter(country %in% top5_countries) %>%
  ggplot(aes(x = year, y = gdp_usd, color = country)) +
  geom_line() +
  labs(
    title = "GDP Over Time for Top 5 Countries",
    x = "Year",
    y = "GDP (Current US$)",
    color = "Country"
  )

Conclusion (Dataset 1)

This GDP dataset started in wide format because every year was a separate column. I converted it into tidy (long) format using pivot_longer(), renamed columns to be consistent, converted the year column into numbers, and removed rows with missing GDP values. After tidying, I analyzed the most recent year in the dataset and found the top 10 countries by GDP. I also plotted the GDP trend over time for the top 5 countries to see how their GDP changes across years.

Dataset 2 — Renewable Energy Consumption

Data Source

This dataset contains Renewable Energy Consumption by country and year.

Source: https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS

Raw file used in this project: API_EG.FEC.RNEW.ZS_DS2_en_csv_v2_4948.csv

The dataset is originally in wide format, where each year is stored as a separate column.

Step 2.1 — Import Raw Wide Data

The World Bank file includes extra descriptive rows at the top, so I skip the first 4 rows.

renew_raw <- read_csv(
  "https://raw.githubusercontent.com/sinemkilicdere/Data607/main/Week6-Project2/API_EG.FEC.RNEW.ZS_DS2_en_csv_v2_4948.csv",
  skip = 4,
  show_col_types = FALSE
)
New names:
• `` -> `...71`

Examine structure

dim(renew_raw)
[1] 266  71
names(renew_raw)[1:10]
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"          
renew_raw %>%
  select(1:8) %>%
  slice(1:6)
# A tibble: 6 × 8
  `Country Name`  `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
  <chr>           <chr>          <chr>            <chr>            <lgl>  <lgl> 
1 Aruba           ABW            Renewable energ… EG.FEC.RNEW.ZS   NA     NA    
2 Africa Eastern… AFE            Renewable energ… EG.FEC.RNEW.ZS   NA     NA    
3 Afghanistan     AFG            Renewable energ… EG.FEC.RNEW.ZS   NA     NA    
4 Africa Western… AFW            Renewable energ… EG.FEC.RNEW.ZS   NA     NA    
5 Angola          AGO            Renewable energ… EG.FEC.RNEW.ZS   NA     NA    
6 Albania         ALB            Renewable energ… EG.FEC.RNEW.ZS   NA     NA    
# ℹ 2 more variables: `1962` <lgl>, `1963` <lgl>
glimpse(renew_raw)
Rows: 266
Columns: 71
$ `Country Name`   <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan"…
$ `Country Code`   <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB…
$ `Indicator Name` <chr> "Renewable energy consumption (% of total final energ…
$ `Indicator Code` <chr> "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.ZS",…
$ `1960`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1961`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1962`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1963`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1964`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1965`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1966`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1967`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1968`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1969`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1970`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1971`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1972`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1973`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1974`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1975`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1976`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1977`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1978`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1979`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1980`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1981`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1982`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1983`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1984`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1985`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1986`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1987`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1988`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1989`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `1990`           <dbl> 0.300000, 60.842404, 23.000000, 85.895442, 72.300000,…
$ `1991`           <dbl> 0.200000, 62.133295, 23.700000, 85.262074, 71.900000,…
$ `1992`           <dbl> 0.200000, 63.940789, 27.400000, 83.818324, 72.700000,…
$ `1993`           <dbl> 0.200000, 64.651965, 28.500000, 85.103232, 71.300000,…
$ `1994`           <dbl> 0.200000, 65.145067, 30.100000, 86.557685, 72.200000,…
$ `1995`           <dbl> 0.200000, 64.721535, 31.800000, 86.148094, 73.800000,…
$ `1996`           <dbl> 0.200000, 63.929272, 33.900000, 85.118648, 72.900000,…
$ `1997`           <dbl> 0.200000, 63.241293, 36.400000, 83.911059, 73.400000,…
$ `1998`           <dbl> 0.200000, 63.962096, 38.000000, 84.634537, 76.400000,…
$ `1999`           <dbl> 0.200000, 65.058401, 43.400000, 84.005039, 72.800000,…
$ `2000`           <dbl> 0.200000, 65.385867, 45.000000, 83.101512, 73.800000,…
$ `2001`           <dbl> 0.200000, 65.817363, 45.600000, 81.939243, 72.800000,…
$ `2002`           <dbl> 0.200000, 65.099281, 37.800000, 81.735528, 70.800000,…
$ `2003`           <dbl> 0.200000, 64.615124, 36.700000, 80.710450, 65.600000,…
$ `2004`           <dbl> 0.200000, 63.678035, 44.200000, 80.790537, 62.300000,…
$ `2005`           <dbl> 0.200000, 63.109098, 33.900000, 80.934034, 68.900000,…
$ `2006`           <dbl> 0.200000, 62.738124, 31.900000, 81.873310, 62.600000,…
$ `2007`           <dbl> 0.200000, 61.505169, 28.800000, 82.330015, 58.900000,…
$ `2008`           <dbl> 0.200000, 61.908027, 21.200000, 81.211170, 55.400000,…
$ `2009`           <dbl> 0.300000, 61.304601, 16.500000, 82.390213, 53.000000,…
$ `2010`           <dbl> 5.500000, 62.876387, 15.200000, 80.687806, 53.100000,…
$ `2011`           <dbl> 5.700000, 62.805279, 12.600000, 79.282287, 51.700000,…
$ `2012`           <dbl> 6.900000, 61.409964, 15.400000, 78.481230, 49.000000,…
$ `2013`           <dbl> 6.900000, 60.801799, 16.900000, 76.701499, 46.800000,…
$ `2014`           <dbl> 6.900000, 61.235882, 19.100000, 75.741009, 47.500000,…
$ `2015`           <dbl> 6.700000, 61.670278, 17.700000, 76.181671, 47.100000,…
$ `2016`           <dbl> 7.000000, 61.822884, 20.200000, 75.796840, 48.100000,…
$ `2017`           <dbl> 6.800000, 61.426950, 19.500000, 76.101681, 52.500000,…
$ `2018`           <dbl> 8.600000, 61.587530, 18.300000, 75.476578, 52.500000,…
$ `2019`           <dbl> 8.300000, 62.690710, 18.900000, 75.078788, 51.000000,…
$ `2020`           <dbl> 9.100000, 65.782380, 18.200000, 75.846980, 60.100000,…
$ `2021`           <dbl> 8.8, NA, 20.0, NA, 52.9, 41.9, 18.4, NA, 1.0, 9.2, 9.…
$ `2022`           <dbl> 8.8, NA, 20.0, NA, NA, NA, 18.7, NA, NA, NA, NA, 0.4,…
$ `2023`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `2024`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `2025`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ...71            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

What I notice: • Each row represents a country. • Each year is stored as a column.

Step 2.2 — Tidying and Transformation

  1. Wide → Long using pivot_longer()
renew_tidy <- renew_raw %>%
  pivot_longer(
    cols = matches("^\\d{4}$"),
    names_to = "year",
    values_to = "renew_percent"
  )
  1. Rename variables consistently
renew_tidy <- renew_tidy %>%
  rename(
    country = `Country Name`,
    country_code = `Country Code`
  ) %>%
  rename_with(tolower)
  1. Fix data types
renew_tidy <- renew_tidy %>%
  mutate(year = as.integer(year))
  1. Handle missing values Remove rows where renewable percentage is missing.
renew_tidy <- renew_tidy %>%
  drop_na(renew_percent)
  1. Check tidy structure
glimpse(renew_tidy)
Rows: 8,234
Columns: 7
$ country          <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba",…
$ country_code     <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW…
$ `indicator name` <chr> "Renewable energy consumption (% of total final energ…
$ `indicator code` <chr> "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.ZS", "EG.FEC.RNEW.ZS",…
$ ...71            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ year             <int> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,…
$ renew_percent    <dbl> 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2…
renew_tidy %>% slice(1:6)
# A tibble: 6 × 7
  country country_code `indicator name`             `indicator code` ...71  year
  <chr>   <chr>        <chr>                        <chr>            <lgl> <int>
1 Aruba   ABW          Renewable energy consumptio… EG.FEC.RNEW.ZS   NA     1990
2 Aruba   ABW          Renewable energy consumptio… EG.FEC.RNEW.ZS   NA     1991
3 Aruba   ABW          Renewable energy consumptio… EG.FEC.RNEW.ZS   NA     1992
4 Aruba   ABW          Renewable energy consumptio… EG.FEC.RNEW.ZS   NA     1993
5 Aruba   ABW          Renewable energy consumptio… EG.FEC.RNEW.ZS   NA     1994
6 Aruba   ABW          Renewable energy consumptio… EG.FEC.RNEW.ZS   NA     1995
# ℹ 1 more variable: renew_percent <dbl>

Step 2.3 — Analysis

A) Most recent year available

latest_year_renew <- max(renew_tidy$year)
latest_year_renew
[1] 2022

B) Top 10 countries with highest renewable energy percentage

top10_renew <- renew_tidy %>%
  filter(year == latest_year_renew) %>%
  filter(
    country != "World",
    !str_detect(country, "income"),
    !str_detect(country, "OECD"),
    !str_detect(country, "dividend"),
    !str_detect(country, "total"),
    !str_detect(country, "IBRD"),
    !str_detect(country, "IDA"),
    !str_detect(country, "&")
  ) %>%
  arrange(desc(renew_percent)) %>%
  slice(1:10)

top10_renew
# A tibble: 10 × 7
   country            country_code `indicator name` `indicator code` ...71  year
   <chr>              <chr>        <chr>            <chr>            <lgl> <int>
 1 Somalia, Fed. Rep. SOM          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 2 Liberia            LBR          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 3 Central African R… CAF          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 4 Uganda             UGA          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 5 Guinea-Bissau      GNB          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 6 Madagascar         MDG          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 7 Burundi            BDI          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 8 Bhutan             BTN          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
 9 Rwanda             RWA          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
10 Sierra Leone       SLE          Renewable energ… EG.FEC.RNEW.ZS   NA     2022
# ℹ 1 more variable: renew_percent <dbl>

C) Bar chart of top 10 renewable countries

top10_renew %>%
  ggplot(aes(x = reorder(country, renew_percent), y = renew_percent)) +
  geom_col() +
  coord_flip() +
  labs(
    title = paste("Top 10 Countries by Renewable Energy % in", latest_year_renew),
    x = "Country",
    y = "Renewable Energy (% of total)"
  )

D) Trend over time for top 5 renewable countries

top5_renew <- top10_renew %>%
  slice(1:5) %>%
  pull(country)

renew_tidy %>%
  filter(country %in% top5_renew) %>%
  ggplot(aes(x = year, y = renew_percent, color = country)) +
  geom_line() +
  labs(
    title = "Renewable Energy % Over Time (Top 5 Countries)",
    x = "Year",
    y = "Renewable Energy (% of total)",
    color = "Country"
  )

Conclusion (Dataset 2)

This dataset was in wide format, with years stored as column names. I transformed it into tidy format using pivot_longer(), renamed variables, converted the year column to numeric, and removed missing values. After tidying, I identified the countries with the highest renewable energy usage in the most recent year and visualized trends over time for leading countries.

Dataset 3 — Fossil CO₂ Emissions

Data Source

This dataset contains fossil CO₂ emissions totals by country across many years in wide format.

Original reference page: https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions#Fossil_carbon_dioxide_emissions_by_country

Raw file used in this project: EDGAR_2024_GHG_booklet_2024_fossilCO2only.csv

Step 3.1 — Import Raw Wide Data

co2_raw <- read_csv(
  "https://raw.githubusercontent.com/sinemkilicdere/Data607/main/Week6-Project2/EDGAR_2024_GHG_booklet_2024_fossilCO2only.csv",
  show_col_types = FALSE
)

Check the raw structure

dim(co2_raw)
[1] 214  57
names(co2_raw)[1:12]
 [1] "Substance"          "EDGAR Country Code" "Country"           
 [4] "1970"               "1971"               "1972"              
 [7] "1973"               "1974"               "1975"              
[10] "1976"               "1977"               "1978"              
co2_raw %>%
  select(1:8) %>%
  slice(1:6)
# A tibble: 6 × 8
  Substance `EDGAR Country Code` Country  `1970`  `1971`  `1972`  `1973`  `1974`
  <chr>     <chr>                <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 CO2       ABW                  Aruba   2.52e-2 2.88e-2 3.95e-2 4.43e-2 4.35e-2
2 CO2       AFG                  Afghan… 1.73e+0 1.73e+0 1.69e+0 1.73e+0 2.19e+0
3 CO2       AGO                  Angola  8.93e+0 8.52e+0 1.04e+1 1.13e+1 1.18e+1
4 CO2       AIA                  Anguil… 2.18e-3 2.18e-3 2.27e-3 2.12e-3 2.36e-3
5 CO2       AIR                  Intern… 1.70e+2 1.70e+2 1.80e+2 1.87e+2 1.80e+2
6 CO2       ALB                  Albania 4.84e+0 4.84e+0 5.52e+0 4.95e+0 5.33e+0
glimpse(co2_raw)
Rows: 214
Columns: 57
$ Substance            <chr> "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", …
$ `EDGAR Country Code` <chr> "ABW", "AFG", "AGO", "AIA", "AIR", "ALB", "ANT", …
$ Country              <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "In…
$ `1970`               <dbl> 2.521379e-02, 1.733920e+00, 8.933899e+00, 2.17758…
$ `1971`               <dbl> 2.882775e-02, 1.733710e+00, 8.519513e+00, 2.17768…
$ `1972`               <dbl> 0.03947211, 1.69358448, 10.36610427, 0.00227319, …
$ `1973`               <dbl> 0.04428944, 1.73390479, 11.34699566, 0.00211848, …
$ `1974`               <dbl> 4.346915e-02, 2.190318e+00, 1.180656e+01, 2.35983…
$ `1975`               <dbl> 5.739627e-02, 2.028967e+00, 1.090465e+01, 2.59365…
$ `1976`               <dbl> 5.642291e-02, 1.892642e+00, 7.291981e+00, 2.44414…
$ `1977`               <dbl> 6.709976e-02, 2.282574e+00, 1.203347e+01, 2.54677…
$ `1978`               <dbl> 7.193698e-02, 1.934107e+00, 1.421627e+01, 2.91135…
$ `1979`               <dbl> 7.591904e-02, 2.059302e+00, 1.417940e+01, 3.22275…
$ `1980`               <dbl> 7.977223e-02, 2.006423e+00, 1.431613e+01, 4.42236…
$ `1981`               <dbl> 8.289314e-02, 2.259933e+00, 1.323227e+01, 6.64942…
$ `1982`               <dbl> 8.182398e-02, 2.309601e+00, 1.277151e+01, 7.02841…
$ `1983`               <dbl> 8.442884e-02, 3.001699e+00, 1.322129e+01, 6.09838…
$ `1984`               <dbl> 9.859186e-02, 3.224699e+00, 1.335655e+01, 5.81064…
$ `1985`               <dbl> 1.041388e-01, 4.123375e+00, 1.410374e+01, 3.68049…
$ `1986`               <dbl> 1.137417e-01, 3.941639e+00, 1.424619e+01, 3.16462…
$ `1987`               <dbl> 1.286949e-01, 3.272775e+00, 1.450164e+01, 6.13632…
$ `1988`               <dbl> 1.428473e-01, 3.087913e+00, 1.494080e+01, 5.80820…
$ `1989`               <dbl> 0.19741783, 2.89595900, 15.10349834, 0.00616006, …
$ `1990`               <dbl> 2.008526e-01, 2.922237e+00, 1.123071e+01, 6.12652…
$ `1991`               <dbl> 0.22624058, 2.84700621, 11.78046209, 0.00759202, …
$ `1992`               <dbl> 2.505677e-01, 1.819948e+00, 1.211898e+01, 7.91597…
$ `1993`               <dbl> 2.513724e-01, 1.784793e+00, 1.214163e+01, 9.96564…
$ `1994`               <dbl> 0.28560127, 1.68515192, 11.49035212, 0.01314511, …
$ `1995`               <dbl> 0.31418487, 1.46299600, 12.95329566, 0.01640989, …
$ `1996`               <dbl> 0.20655787, 1.41718594, 15.67875885, 0.01622795, …
$ `1997`               <dbl> 0.34274167, 1.38464824, 16.40211133, 0.01357871, …
$ `1998`               <dbl> 0.35091413, 1.42041676, 17.03644627, 0.01246408, …
$ `1999`               <dbl> 0.3555122, 1.3755203, 18.0314495, 0.0117665, 342.…
$ `2000`               <dbl> 0.26857504, 1.01605943, 16.51190820, 0.01554756, …
$ `2001`               <dbl> 0.27146397, 0.94468302, 16.19230309, 0.01592658, …
$ `2002`               <dbl> 0.29656290, 0.94226559, 15.91544045, 0.01087533, …
$ `2003`               <dbl> 0.3419259, 1.0079642, 17.3296323, 0.0141239, 367.…
$ `2004`               <dbl> 0.34888891, 0.89288779, 17.69026090, 0.01348946, …
$ `2005`               <dbl> 0.38483943, 1.26634032, 15.63885114, 0.01367976, …
$ `2006`               <dbl> 0.4236865, 1.4082881, 16.3202467, 0.0128311, 444.…
$ `2007`               <dbl> 0.45753275, 2.00956423, 17.05581810, 0.01237373, …
$ `2008`               <dbl> 0.45399354, 3.83338277, 19.11943047, 0.01528787, …
$ `2009`               <dbl> 0.49274181, 6.08278722, 21.08885607, 0.02018065, …
$ `2010`               <dbl> 0.48220690, 7.78889311, 22.73989087, 0.02402448, …
$ `2011`               <dbl> 0.29018068, 11.40871139, 23.65003801, 0.02602831,…
$ `2012`               <dbl> 0.37485742, 9.76775564, 24.71617744, 0.02604374, …
$ `2013`               <dbl> 0.41230731, 8.30002170, 28.53421971, 0.02796127, …
$ `2014`               <dbl> 0.44068922, 7.82574085, 30.88726366, 0.02791705, …
$ `2015`               <dbl> 0.46202559, 8.34652078, 33.09749892, 0.02802664, …
$ `2016`               <dbl> 0.48488861, 7.52759371, 31.28580272, 0.02836258, …
$ `2017`               <dbl> 0.46659161, 8.06613849, 27.94209886, 0.02908662, …
$ `2018`               <dbl> 0.4658812, 7.9320047, 26.2588865, 0.0282473, 615.…
$ `2019`               <dbl> 0.55791727, 7.24906857, 27.57321601, 0.02760442, …
$ `2020`               <dbl> 0.45255306, 7.05413297, 20.71091767, 0.02280365, …
$ `2021`               <dbl> 0.50063525, 7.93078103, 25.26283179, 0.02201783, …
$ `2022`               <dbl> 0.50269325, 8.25991526, 27.35303812, 0.02186134, …
$ `2023`               <dbl> 0.53002610, 8.70734983, 28.22992761, 0.02295555, …

Step 3.2 — Tidying and Transformation

  1. Wide → Long using pivot_longer()
co2_tidy <- co2_raw %>%
  pivot_longer(
    cols = matches("^\\d{4}$"),
    names_to = "year",
    values_to = "co2_emissions"
  )
  1. Rename columns consistently
co2_tidy <- co2_tidy %>%
  rename_with(tolower)
  1. Fix data types
co2_tidy <- co2_tidy %>%
  mutate(year = as.integer(year))
  1. Handle missing values I remove rows where emissions are missing.
co2_tidy <- co2_tidy %>%
  drop_na(co2_emissions)
  1. Check the tidy structure
glimpse(co2_tidy)
Rows: 11,448
Columns: 5
$ substance            <chr> "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", "CO2", …
$ `edgar country code` <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", …
$ country              <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aru…
$ year                 <int> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1…
$ co2_emissions        <dbl> 0.02521379, 0.02882775, 0.03947211, 0.04428944, 0…
co2_tidy %>% slice(1:6)
# A tibble: 6 × 5
  substance `edgar country code` country  year co2_emissions
  <chr>     <chr>                <chr>   <int>         <dbl>
1 CO2       ABW                  Aruba    1970        0.0252
2 CO2       ABW                  Aruba    1971        0.0288
3 CO2       ABW                  Aruba    1972        0.0395
4 CO2       ABW                  Aruba    1973        0.0443
5 CO2       ABW                  Aruba    1974        0.0435
6 CO2       ABW                  Aruba    1975        0.0574

Step 3.3 — Analysis

A) Most recent year available

latest_year_co2 <- max(co2_tidy$year)
latest_year_co2
[1] 2023

B) Top 10 countries by CO₂ emissions in the latest year

top10_co2 <- co2_tidy %>%
  filter(year == latest_year_co2) %>%
  filter(
    country != "World",
    !str_detect(country, "income"),
    !str_detect(country, "OECD"),
    !str_detect(country, "total"),
    !str_detect(country, "&")
  ) %>%
  arrange(desc(co2_emissions)) %>%
  slice(1:10)

top10_co2
# A tibble: 10 × 5
   substance `edgar country code` country                 year co2_emissions
   <chr>     <chr>                <chr>                  <int>         <dbl>
 1 CO2       GLOBAL TOTAL         GLOBAL TOTAL            2023        39024.
 2 CO2       CHN                  China                   2023        13260.
 3 CO2       USA                  United States           2023         4682.
 4 CO2       IND                  India                   2023         2955.
 5 CO2       EU27                 EU27                    2023         2512.
 6 CO2       RUS                  Russia                  2023         2070.
 7 CO2       JPN                  Japan                   2023          945.
 8 CO2       IRN                  Iran                    2023          779.
 9 CO2       SEA                  International Shipping  2023          706.
10 CO2       IDN                  Indonesia               2023          675.

C) Bar chart — Top 10 CO₂ emitters

top10_co2 %>%
  ggplot(aes(x = reorder(country, co2_emissions), y = co2_emissions)) +
  geom_col() +
  coord_flip() +
  labs(
    title = paste("Top 10 Countries by Fossil CO₂ Emissions in", latest_year_co2),
    x = "Country",
    y = "CO₂ Emissions"
  )

D) Trend over time for the top 5 CO₂ emitters

top5_co2 <- top10_co2 %>%
  slice(1:5) %>%
  pull(country)

co2_tidy %>%
  filter(country %in% top5_co2) %>%
  ggplot(aes(x = year, y = co2_emissions, color = country)) +
  geom_line() +
  labs(
    title = "CO₂ Emissions Over Time (Top 5 Countries)",
    x = "Year",
    y = "CO₂ Emissions",
    color = "Country"
  )

Conclusion (Dataset 3)

This CO₂ dataset started in wide format because each year was a separate column. I converted it into tidy format using pivot_longer(), converted the year column into numeric, removed missing values, and then analyzed the latest year to find the top emitting countries. I also plotted emission trends over time for the top 5 countries.