Sus Fin Final Project Data Cleaning

Data Cleaning Plan

I am starting with two datasets from an Excel file named “WEO2023_Free_Dataset.xlsx”, which includes sheets for “World Electricity” and “World CO2 Emissions”. I initially load the data from the “World Electricity” sheet into my WEO2023_Free_Dataset dataframe.

The data I want to focus on is the clean dataset I create, named WEO2023_Clean_Dataset, which I will subject to various transformations. Specifically, I am interested in two subsets of this data: the shares of electricity demand (WEO_Shares) and the values of electricity demand (WEO_Values).

These subsets are further processed to create two datasets in long format:

  • WEO_Shares_long, which will contain shares of electricity demand for different scenarios across various years.

  • WEO_Values_long, which will contain actual values of electricity demand across different years.

While the code does not show any join operation, if I decide to join the WEO_Shares_long and WEO_Values_long datasets, I would use common variables like “Country”, “Type”, “Description”, “Scenario”, and “Year” as my keys for joining because they are present in both long-format datasets.

I need to pivot data from wide to long format for both the shares and values datasets. This means I will convert the datasets from having a separate column for each year’s data to having one column for the year and one column for the corresponding shares or values (Year and Shares for WEO_Shares_long, and Year and Values for WEO_Values_long).

Regarding the data types for each variable:

  • I will treat the country/region identifiers as character or factor data types because they are categorical labels.

  • The “Type” will also be a character or factor, as it appears to represent categories.

  • The “Description” will be a character since it’s likely textual information.

  • The “Scenario” will be a character or factor, to represent the different scenarios like “STEPS”, “APS”, “NZE”.

  • For the “Year”, I will use a date or numeric data type, with a preference for date if I need to represent specific dates within those years.

  • The “Shares” will be numeric, as these are typically proportions or percentages.

  • The “Values” will be numeric since this data likely quantifies electricity demand.

In the process of creating these datasets, I am already converting some columns to the character data type, particularly for scenario names. It’s crucial for me to ensure that the numeric values retain their data type during the pivot operation, and the identifiers or categorical data are treated as characters or factors for accurate analysis and visualization.

Data Cleaning Code

World Electricity Data

Load Libraries & Import Dataset

library("esquisse")
Warning: package 'esquisse' was built under R version 4.2.3
library(tidyverse)
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
Warning: package 'scales' was built under R version 4.2.3

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library("openxlsx")
library(RColorBrewer)
library(fmsb)
Warning: package 'fmsb' was built under R version 4.2.3
library(readxl)

Load Dataset & Initial Tidying

WEO2023_Free_Dataset <- read_excel("~/Desktop/Sus Fin App & Meth/Sus Fin Final/00_data_raw/WEO2023_Free_Dataset.xlsx", sheet = "World Electricity")
New names:
• `` -> `...9`
• `` -> `...10`
• `Stated Policies Scenario` -> `Stated Policies Scenario...11`
• `Stated Policies Scenario` -> `Stated Policies Scenario...12`
• `Stated Policies Scenario` -> `Stated Policies Scenario...13`
• `Stated Policies Scenario` -> `Stated Policies Scenario...14`
• `Stated Policies Scenario` -> `Stated Policies Scenario...15`
• `Stated Policies Scenario` -> `Stated Policies Scenario...16`
• `Stated Policies Scenario` -> `Stated Policies Scenario...17`
• `` -> `...18`
• `NO` -> `NO...19`
• `NO` -> `NO...20`
• `NO` -> `NO...21`
• `` -> `...22`
• `NO` -> `NO...23`
• `NO` -> `NO...24`
• `` -> `...25`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...26`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...27`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...28`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...29`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...30`
• `` -> `...31`
• `NO` -> `NO...32`
• `NO` -> `NO...33`
• `NO` -> `NO...34`
• `` -> `...35`
• `NO` -> `NO...36`
• `NO` -> `NO...37`
• `` -> `...38`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...39`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...40`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...41`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...42`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...43`
• `` -> `...44`
• `NO` -> `NO...45`
• `NO` -> `NO...46`
• `NO` -> `NO...47`
• `` -> `...48`
• `NO` -> `NO...49`
• `NO` -> `NO...50`
WEO2023_Clean_Dataset <- WEO2023_Free_Dataset ### Dataset to be tidied
### Make Scenario Columns ###
WEO2023_Clean_Dataset[, c(9, 25, 38)] <- lapply(WEO2023_Clean_Dataset[, c(9, 25, 38)], as.character)
WEO2023_Clean_Dataset[, 9] <- "STEPS"
WEO2023_Clean_Dataset[, 25] <-"APS"
WEO2023_Clean_Dataset[, 38] <-"NZE"
### Make Row 6 (Years) into column names
colnames(WEO2023_Clean_Dataset) <- as.character(unlist(WEO2023_Clean_Dataset[6, ]))
WEO2023_Clean_Dataset <- WEO2023_Clean_Dataset[-(1:6), ]
WEO2023_Clean_Dataset <- WEO2023_Clean_Dataset[ , -c(1, 2, 7, 8)]
WEO2023_Clean_Dataset <- WEO2023_Clean_Dataset[c(6, 1:5, 7:ncol(WEO2023_Clean_Dataset))]
WEO2023_Clean_Dataset <- rbind(colnames(WEO2023_Clean_Dataset), WEO2023_Clean_Dataset)
colnames(WEO2023_Clean_Dataset) <- 1:ncol(WEO2023_Clean_Dataset)
WEO2023_Clean_Dataset <- WEO2023_Clean_Dataset[ , -c(2, 5, 18:20, 31:33, 44:46)]

Clean, Subset & Pivot Long Dataset on Electricity Demand Shares

################################################################################### WEO Shares Dataset ###########################################
### Scenario Names for Shares ###
colnames(WEO2023_Clean_Dataset) <- 1:ncol(WEO2023_Clean_Dataset)
WEO2023_Clean_Dataset[, c(12, 22, 32)] <- lapply(WEO2023_Clean_Dataset[, c(12, 22, 32)], as.character)
WEO2023_Clean_Dataset[, 12] <- "STEPS"
WEO2023_Clean_Dataset[, 22] <-"APS"
WEO2023_Clean_Dataset[, 32] <-"NZE"

WEO_Shares <- WEO2023_Clean_Dataset[ , c(1:3,12:15, 22:25, 32:35)]
colnames(WEO_Shares) <- as.character(unlist(WEO_Shares[1, ]))
WEO_Shares <- WEO_Shares[-c(1, 21:23), ]
# SUBSET DATA & STACK #
# Subset 1
STEPS_Shares <- WEO_Shares[, c(1:7)]
# Subset 2
APS_Shares <- WEO_Shares[, c(1:3, 8:11)]
colnames(APS_Shares)[4:7] <- colnames(STEPS_Shares)[4:7]
# Subset 3
NZE_Shares<- WEO_Shares[, c(1:3, 12:15)]
colnames(NZE_Shares)[4:7] <- colnames(STEPS_Shares)[4:7]
# Stack Subsets 1-3
stacked_data <- rbind(STEPS_Shares, APS_Shares, NZE_Shares)
colnames(stacked_data)[2:4] <- c("Type", "Description", "Scenario")
### PIVOT DATA ####
WEO_Shares_long <- stacked_data[,-1] %>%
  pivot_longer(cols = 4:6, names_to = "Year", values_to = "Shares")

Clean, Subset & Pivot Long Dataset on Electricity Demand

################################################################################
################# WEO Values Dataset ###########################################
WEO_Values <- WEO2023_Clean_Dataset[ , -c(12:15, 22:25, 32:35)]
new_order <- c(1:3, 5:6, 4, 7:ncol(WEO_Values))
WEO_Values <- WEO_Values[, new_order]
colnames(WEO_Values) <- 1:ncol(WEO_Values)
colnames(WEO_Values) <- as.character(unlist(WEO_Values[1, ]))
WEO_Values <- WEO_Values[-c(1,21:23), ]

### STACKED DATA######
base_segment <- WEO_Values[, 1:11] # Columns 1 to 11
#Prepare the additional segments for stacking
# Including columns 1 to 5 with both segments 12:17 and 18:23
segment_12_17 <- WEO_Values[, c(1:5, 12:17)]
segment_18_23 <- WEO_Values[, c(1:5, 18:23)]
colnames(segment_12_17)[6:11] <- colnames(base_segment)[6:11]
colnames(segment_18_23)[6:11] <- colnames(base_segment)[6:11]
# Step 3: Stack the segments
stacked_values<- rbind(base_segment, segment_12_17, segment_18_23)
colnames(stacked_values)[2:3] <- c("Type", "Description")
new_order <- c(2:3, 6, 4:5, 7:ncol(stacked_values))
stacked_values <- stacked_values[, new_order]
### PIVOT DATA ####
WEO_Values_long <- stacked_values %>%
  pivot_longer(cols = 4:10, names_to = "Year", values_to = "Values")

World CO2 Data

WEO2023_Free_Dataset <- read_excel("~/Desktop/Sus Fin App & Meth/Sus Fin Final/00_data_raw/WEO2023_Free_Dataset.xlsx",sheet = "World CO2 Emissions")
New names:
• `` -> `...9`
• `` -> `...10`
• `Stated Policies Scenario` -> `Stated Policies Scenario...11`
• `Stated Policies Scenario` -> `Stated Policies Scenario...12`
• `Stated Policies Scenario` -> `Stated Policies Scenario...13`
• `Stated Policies Scenario` -> `Stated Policies Scenario...14`
• `Stated Policies Scenario` -> `Stated Policies Scenario...15`
• `Stated Policies Scenario` -> `Stated Policies Scenario...16`
• `Stated Policies Scenario` -> `Stated Policies Scenario...17`
• `` -> `...18`
• `NO` -> `NO...19`
• `NO` -> `NO...20`
• `` -> `...21`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...22`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...23`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...24`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...25`
• `Announced Pledges Scenario` -> `Announced Pledges Scenario...26`
• `` -> `...27`
• `NO` -> `NO...28`
• `NO` -> `NO...29`
• `` -> `...30`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...31`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...32`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...33`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...34`
• `Net Zero Emissions by 2050 Scenario` -> `Net Zero Emissions by 2050
  Scenario...35`
• `` -> `...36`
• `NO` -> `NO...37`
• `NO` -> `NO...38`