This notebook performs an exploratory data analysis (EDA) on the hourly energy demand, generation, pricing, and weather dataset.
This dataset contains four years of hourly data on electricity consumption, generation, prices, and weather conditions in Spain. Energy data was retrieved from ENTSOE (a public TSO portal), pricing data from Red Eléctrica Española, and weather data (for the five largest cities in Spain) was originally obtained via the OpenWeather API and made public by the dataset author.
The main goals of this EDA are:
# Load necessary packages for data wrangling and visualization
library(tidyverse) # Core data manipulation tools
library(lubridate) # Date-time parsing and extraction
library(skimr) # Data summary and structure
library(janitor) # Clean column names
library(ggplot2) # Data visualization
library(ggthemes) # Visualization themes
library(viridis) # Color palettes
library(ggcorrplot) # Correlation matrix visualization
# Read raw CSV files (energy and weather) from 'data/raw' relative to notebook location
energy <- read_csv("../data/raw/energy_dataset.csv")
weather <- read_csv("../data/raw/weather_features.csv")
# Convert column names to snake_case format
energy <- energy %>% clean_names()
weather <- weather %>% clean_names()
To begin the analysis, we perform a quick structural inspection using
glimpse() and a detailed statistical summary with
skim() for both datasets: energy and
weather.
# Display the structure of the energy dataset
glimpse(energy)
## Rows: 35,064
## Columns: 29
## $ time <dttm> 2014-12-31 23:00:00, 2015…
## $ generation_biomass <dbl> 447, 449, 448, 438, 428, 4…
## $ generation_fossil_brown_coal_lignite <dbl> 329, 328, 323, 254, 187, 1…
## $ generation_fossil_coal_derived_gas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_fossil_gas <dbl> 4844, 5196, 4857, 4314, 41…
## $ generation_fossil_hard_coal <dbl> 4821, 4755, 4581, 4131, 38…
## $ generation_fossil_oil <dbl> 162, 158, 157, 160, 156, 1…
## $ generation_fossil_oil_shale <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_fossil_peat <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_geothermal <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_hydro_pumped_storage_aggregated <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ generation_hydro_pumped_storage_consumption <dbl> 863, 920, 1164, 1503, 1826…
## $ generation_hydro_run_of_river_and_poundage <dbl> 1051, 1009, 973, 949, 953,…
## $ generation_hydro_water_reservoir <dbl> 1899, 1658, 1371, 779, 720…
## $ generation_marine <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_nuclear <dbl> 7096, 7096, 7099, 7098, 70…
## $ generation_other <dbl> 43, 43, 43, 43, 43, 43, 43…
## $ generation_other_renewable <dbl> 73, 71, 73, 75, 74, 74, 74…
## $ generation_solar <dbl> 49, 50, 50, 50, 42, 34, 34…
## $ generation_waste <dbl> 196, 195, 196, 191, 189, 1…
## $ generation_wind_offshore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_wind_onshore <dbl> 6378, 5890, 5461, 5238, 49…
## $ forecast_solar_day_ahead <dbl> 17, 16, 8, 2, 9, 4, 3, 12,…
## $ forecast_wind_offshore_eday_ahead <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ forecast_wind_onshore_day_ahead <dbl> 6436, 5856, 5454, 5151, 48…
## $ total_load_forecast <dbl> 26118, 24934, 23515, 22642…
## $ total_load_actual <dbl> 25385, 24382, 22734, 21286…
## $ price_day_ahead <dbl> 50.10, 48.10, 47.33, 42.27…
## $ price_actual <dbl> 65.41, 64.92, 64.48, 59.32…
The energy dataset contains 35,064 hourly records and 29 columns, which include:
time)generation_biomass,
generation_solar)total_load_forecast, total_load_actual)price_day_ahead,
price_actual)The data appears to span from 2014-12-31 to 2018-12-31 with 1-hour intervals.
# Display the structure of the weather dataset
glimpse(weather)
## Rows: 178,396
## Columns: 17
## $ dt_iso <dttm> 2014-12-31 23:00:00, 2015-01-01 00:00:00, 2015-01…
## $ city_name <chr> "Valencia", "Valencia", "Valencia", "Valencia", "V…
## $ temp <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ temp_min <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ temp_max <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ pressure <dbl> 1001, 1001, 1002, 1002, 1002, 1004, 1004, 1004, 10…
## $ humidity <dbl> 77, 77, 78, 78, 78, 71, 71, 71, 71, 71, 71, 55, 55…
## $ wind_speed <dbl> 1, 1, 0, 0, 0, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ wind_deg <dbl> 62, 62, 23, 23, 23, 321, 321, 321, 307, 307, 307, …
## $ rain_1h <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rain_3h <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ snow_3h <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clouds_all <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather_id <dbl> 800, 800, 800, 800, 800, 800, 800, 800, 800, 800, …
## $ weather_main <chr> "clear", "clear", "clear", "clear", "clear", "clea…
## $ weather_description <chr> "sky is clear", "sky is clear", "sky is clear", "s…
## $ weather_icon <chr> "01n", "01n", "01n", "01n", "01n", "01n", "01n", "…
The weather dataset has 178,396 rows and 17 columns, including:
dt_iso)city_name)temp, humidity,
wind_speed, pressure, and
weather_descriptionThis dataset also has a datetime range matching the energy dataset, which facilitates temporal merging.
# Display summary statistics for the energy dataset
skim(energy)
| Name | energy |
| Number of rows | 35064 |
| Number of columns | 29 |
| _______________________ | |
| Column type frequency: | |
| logical | 2 |
| numeric | 26 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| generation_hydro_pumped_storage_aggregated | 35064 | 0 | NaN | : |
| forecast_wind_offshore_eday_ahead | 35064 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| generation_biomass | 19 | 1 | 383.51 | 85.35 | 0.00 | 333.00 | 367.00 | 433.00 | 592.00 | ▁▁▇▇▅ |
| generation_fossil_brown_coal_lignite | 18 | 1 | 448.06 | 354.57 | 0.00 | 0.00 | 509.00 | 757.00 | 999.00 | ▇▂▅▅▆ |
| generation_fossil_coal_derived_gas | 18 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| generation_fossil_gas | 18 | 1 | 5622.74 | 2201.83 | 0.00 | 4126.00 | 4969.00 | 6429.00 | 20034.00 | ▂▇▂▁▁ |
| generation_fossil_hard_coal | 18 | 1 | 4256.07 | 1961.60 | 0.00 | 2527.00 | 4474.00 | 5838.75 | 8359.00 | ▃▆▆▇▃ |
| generation_fossil_oil | 19 | 1 | 298.32 | 52.52 | 0.00 | 263.00 | 300.00 | 330.00 | 449.00 | ▁▁▃▇▂ |
| generation_fossil_oil_shale | 18 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| generation_fossil_peat | 18 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| generation_geothermal | 18 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| generation_hydro_pumped_storage_consumption | 19 | 1 | 475.58 | 792.41 | 0.00 | 0.00 | 68.00 | 616.00 | 4523.00 | ▇▁▁▁▁ |
| generation_hydro_run_of_river_and_poundage | 19 | 1 | 972.12 | 400.78 | 0.00 | 637.00 | 906.00 | 1250.00 | 2000.00 | ▁▇▆▃▂ |
| generation_hydro_water_reservoir | 18 | 1 | 2605.11 | 1835.20 | 0.00 | 1077.25 | 2164.00 | 3757.00 | 9728.00 | ▇▆▃▁▁ |
| generation_marine | 19 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| generation_nuclear | 17 | 1 | 6263.91 | 839.67 | 0.00 | 5760.00 | 6566.00 | 7025.00 | 7117.00 | ▁▁▁▂▇ |
| generation_other | 18 | 1 | 60.23 | 20.24 | 0.00 | 53.00 | 57.00 | 80.00 | 106.00 | ▁▁▇▂▂ |
| generation_other_renewable | 18 | 1 | 85.64 | 14.08 | 0.00 | 73.00 | 88.00 | 97.00 | 119.00 | ▁▁▃▇▅ |
| generation_solar | 18 | 1 | 1432.67 | 1680.12 | 0.00 | 71.00 | 616.00 | 2578.00 | 5792.00 | ▇▂▁▁▁ |
| generation_waste | 19 | 1 | 269.45 | 50.20 | 0.00 | 240.00 | 279.00 | 310.00 | 357.00 | ▁▁▂▇▇ |
| generation_wind_offshore | 18 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| generation_wind_onshore | 18 | 1 | 5464.48 | 3213.69 | 0.00 | 2933.00 | 4849.00 | 7398.00 | 17436.00 | ▇▇▅▂▁ |
| forecast_solar_day_ahead | 0 | 1 | 1439.07 | 1677.70 | 0.00 | 69.00 | 576.00 | 2636.00 | 5836.00 | ▇▂▁▂▁ |
| forecast_wind_onshore_day_ahead | 0 | 1 | 5471.22 | 3176.31 | 237.00 | 2979.00 | 4855.00 | 7353.00 | 17430.00 | ▇▇▃▂▁ |
| total_load_forecast | 0 | 1 | 28712.13 | 4594.10 | 18105.00 | 24793.75 | 28906.00 | 32263.25 | 41390.00 | ▂▇▇▆▁ |
| total_load_actual | 36 | 1 | 28696.94 | 4574.99 | 18041.00 | 24807.75 | 28901.00 | 32192.00 | 41015.00 | ▂▇▇▆▁ |
| price_day_ahead | 0 | 1 | 49.87 | 14.62 | 2.06 | 41.49 | 50.52 | 60.53 | 101.99 | ▁▃▇▃▁ |
| price_actual | 0 | 1 | 57.88 | 14.20 | 9.33 | 49.35 | 58.02 | 68.01 | 116.80 | ▁▅▇▂▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time | 0 | 1 | 2014-12-31 23:00:00 | 2018-12-31 22:00:00 | 2016-12-31 10:30:00 | 35064 |
dbl), with one datetime
column (POSIXct) and two logical columns that represent
completely missing data (NA only).generation_hydro_pumped_storage_aggregatedforecast_wind_offshore_eday_ahead These two were
removed from the dataset due to having 100% missing
values.# Display summary statistics for the weather dataset
skim(weather)
| Name | weather |
| Number of rows | 178396 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 12 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| city_name | 0 | 1 | 6 | 9 | 0 | 5 | 0 |
| weather_main | 0 | 1 | 3 | 12 | 0 | 12 | 0 |
| weather_description | 0 | 1 | 3 | 28 | 0 | 43 | 0 |
| weather_icon | 0 | 1 | 2 | 3 | 0 | 24 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| temp | 0 | 1 | 289.62 | 8.03 | 262.24 | 283.67 | 289.15 | 295.15 | 315.60 | ▁▅▇▅▁ |
| temp_min | 0 | 1 | 288.33 | 7.96 | 262.24 | 282.48 | 288.15 | 293.73 | 315.15 | ▁▅▇▃▁ |
| temp_max | 0 | 1 | 291.09 | 8.61 | 262.24 | 284.65 | 290.15 | 297.15 | 321.15 | ▁▅▇▃▁ |
| pressure | 0 | 1 | 1069.26 | 5969.63 | 0.00 | 1013.00 | 1018.00 | 1022.00 | 1008371.00 | ▇▁▁▁▁ |
| humidity | 0 | 1 | 68.42 | 21.90 | 0.00 | 53.00 | 72.00 | 87.00 | 100.00 | ▁▂▅▆▇ |
| wind_speed | 0 | 1 | 2.47 | 2.10 | 0.00 | 1.00 | 2.00 | 4.00 | 133.00 | ▇▁▁▁▁ |
| wind_deg | 0 | 1 | 166.59 | 116.61 | 0.00 | 55.00 | 177.00 | 270.00 | 360.00 | ▇▅▃▆▆ |
| rain_1h | 0 | 1 | 0.08 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 12.00 | ▇▁▁▁▁ |
| rain_3h | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 2.32 | ▇▁▁▁▁ |
| snow_3h | 0 | 1 | 0.00 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 21.50 | ▇▁▁▁▁ |
| clouds_all | 0 | 1 | 25.07 | 30.77 | 0.00 | 0.00 | 20.00 | 40.00 | 100.00 | ▇▁▁▂▁ |
| weather_id | 0 | 1 | 759.83 | 108.73 | 200.00 | 800.00 | 800.00 | 801.00 | 804.00 | ▁▁▁▁▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| dt_iso | 0 | 1 | 2014-12-31 23:00:00 | 2018-12-31 22:00:00 | 2017-01-05 05:00:00 | 35064 |
no missing values
in any column.temp,
pressure, wind_speed, etc.)weather_main,
weather_description)weather_id,
weather_icon)This dataset is well-suited for modeling, with consistent sampling and granularity.
Finally we treat our datasets based on out observations:
# Summarize missing values per variable in both datasets to assess data quality
# NA summary - Energy
energy %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
mutate(
pct_missing = round(100 * n_missing / nrow(energy), 2)
) %>%
arrange(desc(pct_missing))
## # A tibble: 29 × 3
## variable n_missing pct_missing
## <chr> <int> <dbl>
## 1 generation_hydro_pumped_storage_aggregated 35064 100
## 2 forecast_wind_offshore_eday_ahead 35064 100
## 3 total_load_actual 36 0.1
## 4 generation_biomass 19 0.05
## 5 generation_fossil_brown_coal_lignite 18 0.05
## 6 generation_fossil_coal_derived_gas 18 0.05
## 7 generation_fossil_gas 18 0.05
## 8 generation_fossil_hard_coal 18 0.05
## 9 generation_fossil_oil 19 0.05
## 10 generation_fossil_oil_shale 18 0.05
## # ℹ 19 more rows
# NA summary - Weather
weather %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
mutate(
pct_missing = round(100 * n_missing / nrow(weather), 2)
) %>%
arrange(desc(pct_missing))
## # A tibble: 17 × 3
## variable n_missing pct_missing
## <chr> <int> <dbl>
## 1 dt_iso 0 0
## 2 city_name 0 0
## 3 temp 0 0
## 4 temp_min 0 0
## 5 temp_max 0 0
## 6 pressure 0 0
## 7 humidity 0 0
## 8 wind_speed 0 0
## 9 wind_deg 0 0
## 10 rain_1h 0 0
## 11 rain_3h 0 0
## 12 snow_3h 0 0
## 13 clouds_all 0 0
## 14 weather_id 0 0
## 15 weather_main 0 0
## 16 weather_description 0 0
## 17 weather_icon 0 0
# Remove columns that contain 100% missing values and provide no useful information
energy <- energy %>%
select(-any_of(c("generation_hydro_pumped_storage_aggregated",
"forecast_wind_offshore_eday_ahead")))
datetime for consistency.# Rename time columns to 'datetime' for consistency
energy <- energy %>%
rename(datetime = time)
weather <- weather %>%
rename(datetime = dt_iso)
and that’s it! We are ready to merge the datasets.
Once we have cleaned the datasets, we can merge them based on the
datetime column. This will allow us to analyze the
relationship between energy consumption and
weather conditions.
combined <- left_join(energy, weather, by = "datetime")
We use left_join() to ensure that all records from the
energy dataset are retained, even if there are no corresponding weather
records.
Note: Weather data includes observations from five major Spanish cities. Since the energy dataset is aggregated at the national level, no spatial distinction is made in this EDA. All weather records are merged and interpreted as representative of national conditions. In a future step, we could aggregate weather by hour (averaging across cities) or isolate individual cities if needed.
In this section, we visualize major variables over time to understand
trends, patterns, and potential anomalies. All plots are based on the
combined dataset, which merges weather and
energy information.
This plot shows the forecasted vs actual energy load over time. While the general trends match, we can spot some deviations, which may indicate model drift, unexpected events, or forecasting bias.
combined %>%
select(datetime, total_load_forecast, total_load_actual) %>%
pivot_longer(-datetime) %>%
ggplot(aes(x = datetime, y = value, color = name)) +
geom_line(alpha = 0.6) +
labs(title = "Energy Load: Forecast vs Actual", y = "Load (MW)", x = "Time") +
theme_minimal()
We track the evolution of temperature, humidity, and wind speed. The periodic structure (particularly in temperature) suggests strong seasonality. Humidity and wind speed show more variability, potentially correlating with energy production in certain sources.
combined %>%
select(datetime, temp, wind_speed, humidity) %>%
pivot_longer(-datetime) %>%
ggplot(aes(x = datetime, y = value, color = name)) +
geom_line(alpha = 0.5) +
facet_wrap(~name, scales = "free_y") +
labs(title = "Weather Variables Over Time", y = "Value", x = "Time") +
theme_minimal()
This chart compares the day-ahead market price with the actual price. Large discrepancies may highlight forecast errors or market volatility. These trends are essential for assessing model accuracy and energy strategy.
combined %>%
select(datetime, price_day_ahead, price_actual) %>%
pivot_longer(-datetime) %>%
ggplot(aes(x = datetime, y = value, color = name)) +
geom_line(alpha = 0.6) +
labs(title = "Electricity Prices: Day Ahead vs Actual", y = "€/MWh", x = "Time") +
theme_minimal()
To better understand short-term fluctuations, we zoom into the first week of 2015. We observe clear daily cycles, indicating high periodicity and potential for time-series decomposition.
combined %>%
filter(datetime >= as.POSIXct("2015-01-01"),
datetime <= as.POSIXct("2015-01-07")) %>%
ggplot(aes(x = datetime, y = total_load_actual)) +
geom_line() +
labs(title = "Energy Load - Sample Week", y = "MW", x = "Date") +
theme_minimal()
This section analyzes temporal patterns in energy demand by extracting time-based features and visualizing their influence on consumption.
# Extract hour, weekday and month from datetime
combined <- combined %>%
mutate(
hour = hour(datetime),
wday = wday(datetime, label = TRUE, abbr = TRUE), # e.g., Mon, Tue
month = month(datetime, label = TRUE, abbr = TRUE) # e.g., Jan, Feb
)
# Average actual load by hour of the day
combined %>%
group_by(hour) %>%
summarise(avg_load = mean(total_load_actual, na.rm = TRUE)) %>%
ggplot(aes(x = hour, y = avg_load)) +
geom_line(color = "steelblue", linewidth = 1) +
labs(
title = "Average Energy Load by Hour",
x = "Hour of Day", y = "Average Load (MW)"
) +
theme_minimal()
💡 Insight: This plot reveals the daily cycle of energy demand, typically with peaks in the morning and evening, and a drop at night.
# Distribution of actual load by weekday
combined %>%
mutate(wday = factor(wday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
ggplot(aes(x = wday, y = total_load_actual)) +
geom_boxplot(fill = "skyblue", alpha = 0.6, outlier.color = "gray") +
labs(
title = "Energy Load Distribution by Weekday",
x = "Day of Week", y = "Load (MW)"
) +
theme_minimal()
💡 Insight: Boxplots show variability between weekdays and weekends. Useful to detect behavioral or industrial consumption patterns.
# Average load per month across all years
combined %>%
group_by(month) %>%
summarise(avg_load = mean(total_load_actual, na.rm = TRUE)) %>%
ggplot(aes(x = month, y = avg_load, group = 1)) +
geom_line(color = "darkgreen", linewidth = 1) +
geom_point(color = "darkgreen", size = 2) +
labs(
title = "Monthly Average Energy Load",
x = "Month", y = "Average Load (MW)"
) +
theme_minimal()
💡 Insight: This plot helps identify seasonal demand patterns. For example, summer or winter peaks could guide forecasting strategies.
In this section, we explore the relationships between numeric features, particularly between weather conditions and energy load.
Correlation analysis helps us understand the relationships between numeric features in the dataset. This is useful for:
In this notebook, we use Pearson correlation, which
measures linear relationships. We also apply drop_na() to
ensure the correlations are calculated on complete data only.
# Select only numeric columns and compute Pearson correlation
correlation_matrix <- combined %>%
select(where(is.numeric)) %>%
drop_na() %>%
cor(method = "pearson")
# Plot correlation matrix using ggcorrplot
ggcorrplot(correlation_matrix,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 2.5,
colors = c("#6D9EC1", "white", "#E46767"),
title = "Correlation Matrix (Numerical Features)",
ggtheme = theme_minimal())
💡 Note: This global correlation matrix provides a technical overview of all numeric features. While useful to detect potential redundancies or collinearities, its density makes it hard to extract actionable insights.
We defer focused interpretation to the next section.
To better understand how energy consumption relates to weather, we isolated key variables and computed their pairwise Pearson correlations.
# Focus only on energy load and weather variables
subset_corr <- combined %>%
select(total_load_actual, total_load_forecast, temp, humidity, wind_speed, pressure) %>%
drop_na() %>%
cor()
ggcorrplot(subset_corr, lab = TRUE, type = "lower", ggtheme = theme_minimal())
total_load_actual and
total_load_forecast are very strongly correlated (≈ 1),
which confirms the forecast’s high alignment with actual demand.temp and humidity exhibit a strong
negative correlation (-0.574), reflecting the physical inverse
relationship between air temperature and relative humidity.total_load_actual) and weather conditions is weak to
moderate:
Pressure is effectively uncorrelated with both
energy demand and other variables (all correlations ≈ 0).These results suggest that while weather contributes to variations in energy usage, its linear influence is limited and possibly non-linear or confounded by other factors like time-of-day or city.
This exploratory analysis has revealed several key patterns in the energy demand dataset:
total_load_actual and total_load_forecast
confirms the reliability of the forecasts provided.This EDA serves as a foundation to inform feature selection and modeling decisions in the next phase of this project.
As a final result, we export our final combined
dataset for future use.
# Create 'processed' folder if it doesn't exist
if (!dir.exists("../data/processed")) {
dir.create("../data/processed", recursive = TRUE)
}
# Export combined cleaned dataset for use in Python
write_csv(combined, "../data/processed/combined_clean.csv")