I am starting with two datasets from an Excel file named “WEO2023_Free_Dataset.xlsx”, which includes sheets for “World Electricity” and “World CO2 Emissions”. I initially load the data from the “World Electricity” sheet into my WEO2023_Free_Dataset dataframe.
The data I want to focus on is the clean dataset I create, named WEO2023_Clean_Dataset, which I will subject to various transformations. Specifically, I am interested in two subsets of this data: the shares of electricity demand (WEO_Shares) and the values of electricity demand (WEO_Values).
These subsets are further processed to create two datasets in long format:
WEO_Shares_long, which will contain shares of electricity demand for different scenarios across various years.
WEO_Values_long, which will contain actual values of electricity demand across different years.
While the code does not show any join operation, if I decide to join the WEO_Shares_long and WEO_Values_long datasets, I would use common variables like “Country”, “Type”, “Description”, “Scenario”, and “Year” as my keys for joining because they are present in both long-format datasets.
I need to pivot data from wide to long format for both the shares and values datasets. This means I will convert the datasets from having a separate column for each year’s data to having one column for the year and one column for the corresponding shares or values (Year and Shares for WEO_Shares_long, and Year and Values for WEO_Values_long).
Regarding the data types for each variable:
I will treat the country/region identifiers as character or factor data types because they are categorical labels.
The “Type” will also be a character or factor, as it appears to represent categories.
The “Description” will be a character since it’s likely textual information.
The “Scenario” will be a character or factor, to represent the different scenarios like “STEPS”, “APS”, “NZE”.
For the “Year”, I will use a date or numeric data type, with a preference for date if I need to represent specific dates within those years.
The “Shares” will be numeric, as these are typically proportions or percentages.
The “Values” will be numeric since this data likely quantifies electricity demand.
In the process of creating these datasets, I am already converting some columns to the character data type, particularly for scenario names. It’s crucial for me to ensure that the numeric values retain their data type during the pivot operation, and the identifiers or categorical data are treated as characters or factors for accurate analysis and visualization.
Data Cleaning Code
World Electricity Data
Load Libraries & Import Dataset
library("esquisse")
Warning: package 'esquisse' was built under R version 4.2.3
library(tidyverse)
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
Warning: package 'scales' was built under R version 4.2.3
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor