EDA - Energy Consumption and Weather Data

This notebook performs an exploratory data analysis (EDA) on the hourly energy demand, generation, pricing, and weather dataset.

This dataset contains four years of hourly data on electricity consumption, generation, prices, and weather conditions in Spain. Energy data was retrieved from ENTSOE (a public TSO portal), pricing data from Red Eléctrica Española, and weather data (for the five largest cities in Spain) was originally obtained via the OpenWeather API and made public by the dataset author.

The main goals of this EDA are:

To understand the structure and quality of the datasets.
To detect missing or inconsistent values.
To identify temporal patterns in energy usage.
To prepare the data for feature engineering and modeling.

📦 Load libraries

# Load necessary packages for data wrangling and visualization
library(tidyverse)    # Core data manipulation tools
library(lubridate)    # Date-time parsing and extraction
library(skimr)        # Data summary and structure
library(janitor)      # Clean column names
library(ggplot2)      # Data visualization
library(ggthemes)     # Visualization themes
library(viridis)      # Color palettes
library(ggcorrplot)   # Correlation matrix visualization

📁 Load data

# Read raw CSV files (energy and weather) from 'data/raw' relative to notebook location
energy <- read_csv("../data/raw/energy_dataset.csv")
weather <- read_csv("../data/raw/weather_features.csv")

🧹 Clean column names

# Convert column names to snake_case format
energy <- energy %>% clean_names()
weather <- weather %>% clean_names()

🧹 1. Data Overview & Structure

To begin the analysis, we perform a quick structural inspection using glimpse() and a detailed statistical summary with skim() for both datasets: energy and weather.

🔍 1.1 Glimpse: See Columns and Data Types

🔍 1.1.1 Glimpse: Energy Dataset

# Display the structure of the energy dataset
glimpse(energy)

## Rows: 35,064
## Columns: 29
## $ time                                        <dttm> 2014-12-31 23:00:00, 2015…
## $ generation_biomass                          <dbl> 447, 449, 448, 438, 428, 4…
## $ generation_fossil_brown_coal_lignite        <dbl> 329, 328, 323, 254, 187, 1…
## $ generation_fossil_coal_derived_gas          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_fossil_gas                       <dbl> 4844, 5196, 4857, 4314, 41…
## $ generation_fossil_hard_coal                 <dbl> 4821, 4755, 4581, 4131, 38…
## $ generation_fossil_oil                       <dbl> 162, 158, 157, 160, 156, 1…
## $ generation_fossil_oil_shale                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_fossil_peat                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_geothermal                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_hydro_pumped_storage_aggregated  <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ generation_hydro_pumped_storage_consumption <dbl> 863, 920, 1164, 1503, 1826…
## $ generation_hydro_run_of_river_and_poundage  <dbl> 1051, 1009, 973, 949, 953,…
## $ generation_hydro_water_reservoir            <dbl> 1899, 1658, 1371, 779, 720…
## $ generation_marine                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_nuclear                          <dbl> 7096, 7096, 7099, 7098, 70…
## $ generation_other                            <dbl> 43, 43, 43, 43, 43, 43, 43…
## $ generation_other_renewable                  <dbl> 73, 71, 73, 75, 74, 74, 74…
## $ generation_solar                            <dbl> 49, 50, 50, 50, 42, 34, 34…
## $ generation_waste                            <dbl> 196, 195, 196, 191, 189, 1…
## $ generation_wind_offshore                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_wind_onshore                     <dbl> 6378, 5890, 5461, 5238, 49…
## $ forecast_solar_day_ahead                    <dbl> 17, 16, 8, 2, 9, 4, 3, 12,…
## $ forecast_wind_offshore_eday_ahead           <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ forecast_wind_onshore_day_ahead             <dbl> 6436, 5856, 5454, 5151, 48…
## $ total_load_forecast                         <dbl> 26118, 24934, 23515, 22642…
## $ total_load_actual                           <dbl> 25385, 24382, 22734, 21286…
## $ price_day_ahead                             <dbl> 50.10, 48.10, 47.33, 42.27…
## $ price_actual                                <dbl> 65.41, 64.92, 64.48, 59.32…

The energy dataset contains 35,064 hourly records and 29 columns, which include:

Timestamps (time)
Energy generation by source (e.g., generation_biomass, generation_solar)
Forecast and actual values for energy consumption (total_load_forecast, total_load_actual)
Pricing variables (price_day_ahead, price_actual)

The data appears to span from 2014-12-31 to 2018-12-31 with 1-hour intervals.

🔍 1.1.2 Glimpse: Weather Dataset

# Display the structure of the weather dataset
glimpse(weather)

## Rows: 178,396
## Columns: 17
## $ dt_iso              <dttm> 2014-12-31 23:00:00, 2015-01-01 00:00:00, 2015-01…
## $ city_name           <chr> "Valencia", "Valencia", "Valencia", "Valencia", "V…
## $ temp                <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ temp_min            <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ temp_max            <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ pressure            <dbl> 1001, 1001, 1002, 1002, 1002, 1004, 1004, 1004, 10…
## $ humidity            <dbl> 77, 77, 78, 78, 78, 71, 71, 71, 71, 71, 71, 55, 55…
## $ wind_speed          <dbl> 1, 1, 0, 0, 0, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ wind_deg            <dbl> 62, 62, 23, 23, 23, 321, 321, 321, 307, 307, 307, …
## $ rain_1h             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rain_3h             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ snow_3h             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clouds_all          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather_id          <dbl> 800, 800, 800, 800, 800, 800, 800, 800, 800, 800, …
## $ weather_main        <chr> "clear", "clear", "clear", "clear", "clear", "clea…
## $ weather_description <chr> "sky is clear", "sky is clear", "sky is clear", "s…
## $ weather_icon        <chr> "01n", "01n", "01n", "01n", "01n", "01n", "01n", "…

The weather dataset has 178,396 rows and 17 columns, including:

Timestamps (dt_iso)
Weather observations for major Spanish cities (city_name)
Variables such as temp, humidity, wind_speed, pressure, and weather_description

This dataset also has a datetime range matching the energy dataset, which facilitates temporal merging.

📊 1.2 Skim: Summary Statistics

⚡ 1.2.1 Skim: Energy Dataset

# Display summary statistics for the energy dataset
skim(energy)

Data summary
Name	energy
Number of rows	35064
Number of columns	29
_______________________
Column type frequency:
logical	2
numeric	26
POSIXct	1
________________________
Group variables	None

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
generation_hydro_pumped_storage_aggregated	35064	0	NaN	:
forecast_wind_offshore_eday_ahead	35064	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
generation_biomass	19	1	383.51	85.35	0.00	333.00	367.00	433.00	592.00	▁▁▇▇▅
generation_fossil_brown_coal_lignite	18	1	448.06	354.57	0.00	0.00	509.00	757.00	999.00	▇▂▅▅▆
generation_fossil_coal_derived_gas	18	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
generation_fossil_gas	18	1	5622.74	2201.83	0.00	4126.00	4969.00	6429.00	20034.00	▂▇▂▁▁
generation_fossil_hard_coal	18	1	4256.07	1961.60	0.00	2527.00	4474.00	5838.75	8359.00	▃▆▆▇▃
generation_fossil_oil	19	1	298.32	52.52	0.00	263.00	300.00	330.00	449.00	▁▁▃▇▂
generation_fossil_oil_shale	18	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
generation_fossil_peat	18	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
generation_geothermal	18	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
generation_hydro_pumped_storage_consumption	19	1	475.58	792.41	0.00	0.00	68.00	616.00	4523.00	▇▁▁▁▁
generation_hydro_run_of_river_and_poundage	19	1	972.12	400.78	0.00	637.00	906.00	1250.00	2000.00	▁▇▆▃▂
generation_hydro_water_reservoir	18	1	2605.11	1835.20	0.00	1077.25	2164.00	3757.00	9728.00	▇▆▃▁▁
generation_marine	19	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
generation_nuclear	17	1	6263.91	839.67	0.00	5760.00	6566.00	7025.00	7117.00	▁▁▁▂▇
generation_other	18	1	60.23	20.24	0.00	53.00	57.00	80.00	106.00	▁▁▇▂▂
generation_other_renewable	18	1	85.64	14.08	0.00	73.00	88.00	97.00	119.00	▁▁▃▇▅
generation_solar	18	1	1432.67	1680.12	0.00	71.00	616.00	2578.00	5792.00	▇▂▁▁▁
generation_waste	19	1	269.45	50.20	0.00	240.00	279.00	310.00	357.00	▁▁▂▇▇
generation_wind_offshore	18	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
generation_wind_onshore	18	1	5464.48	3213.69	0.00	2933.00	4849.00	7398.00	17436.00	▇▇▅▂▁
forecast_solar_day_ahead	0	1	1439.07	1677.70	0.00	69.00	576.00	2636.00	5836.00	▇▂▁▂▁
forecast_wind_onshore_day_ahead	0	1	5471.22	3176.31	237.00	2979.00	4855.00	7353.00	17430.00	▇▇▃▂▁
total_load_forecast	0	1	28712.13	4594.10	18105.00	24793.75	28906.00	32263.25	41390.00	▂▇▇▆▁
total_load_actual	36	1	28696.94	4574.99	18041.00	24807.75	28901.00	32192.00	41015.00	▂▇▇▆▁
price_day_ahead	0	1	49.87	14.62	2.06	41.49	50.52	60.53	101.99	▁▃▇▃▁
price_actual	0	1	57.88	14.20	9.33	49.35	58.02	68.01	116.80	▁▅▇▂▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
time	0	1	2014-12-31 23:00:00	2018-12-31 22:00:00	2016-12-31 10:30:00	35064

Most columns are numeric (dbl), with one datetime column (POSIXct) and two logical columns that represent completely missing data (NA only).
Missing data is generally very low (≤ 0.1%) across most variables, except for two:
- generation_hydro_pumped_storage_aggregated
- forecast_wind_offshore_eday_ahead These two were removed from the dataset due to having 100% missing values.
Some generation sources such as coal-derived gas, oil shale, and geothermal are recorded as zeros or missing throughout the entire time series — this might be expected depending on Spain’s energy infrastructure.

🌦️ 1.2.2 Skim: Weather Dataset

# Display summary statistics for the weather dataset
skim(weather)

Data summary
Name	weather
Number of rows	178396
Number of columns	17
_______________________
Column type frequency:
character	4
numeric	12
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
city_name	1	6	9	5
weather_main	1	3	12	12
weather_description	1	3	28	43
weather_icon	1	2	3	24

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
temp	1	289.62	8.03	262.24	283.67	289.15	295.15	315.60	▁▅▇▅▁
temp_min	1	288.33	7.96	262.24	282.48	288.15	293.73	315.15	▁▅▇▃▁
temp_max	1	291.09	8.61	262.24	284.65	290.15	297.15	321.15	▁▅▇▃▁
pressure	1	1069.26	5969.63	0.00	1013.00	1018.00	1022.00	1008371.00	▇▁▁▁▁
humidity	1	68.42	21.90	0.00	53.00	72.00	87.00	100.00	▁▂▅▆▇
wind_speed	1	2.47	2.10	0.00	1.00	2.00	4.00	133.00	▇▁▁▁▁
wind_deg	1	166.59	116.61	0.00	55.00	177.00	270.00	360.00	▇▅▃▆▆
rain_1h	1	0.08	0.40	0.00	0.00	0.00	0.00	12.00	▇▁▁▁▁
rain_3h	1	0.00	0.01	0.00	0.00	0.00	0.00	2.32	▇▁▁▁▁
snow_3h	1	0.00	0.22	0.00	0.00	0.00	0.00	21.50	▇▁▁▁▁
clouds_all	1	25.07	30.77	0.00	0.00	20.00	40.00	100.00	▇▁▁▂▁
weather_id	1	759.83	108.73	200.00	800.00	800.00	801.00	804.00	▁▁▁▁▇

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
dt_iso	0	1	2014-12-31 23:00:00	2018-12-31 22:00:00	2017-01-05 05:00:00	35064

Weather data is well-structured with no missing values in any column.
It includes:
- Numeric weather conditions (temp, pressure, wind_speed, etc.)
- Categorical weather types (weather_main, weather_description)
- Identifiers and icons from the API (weather_id, weather_icon)

This dataset is well-suited for modeling, with consistent sampling and granularity.

📌 1.3 Conclusion of Initial Inspection

Both datasets are clean and well-structured.
We removed only two columns due to full missingness.
The time coverage and granularity are compatible, making them ideal for feature engineering and merging.
From this point forward, we can derive temporal features, visualize trends, and assess correlations.

Finally we treat our datasets based on out observations:

We double-check for missing values and

# Summarize missing values per variable in both datasets to assess data quality
# NA summary - Energy
energy %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  mutate(
    pct_missing = round(100 * n_missing / nrow(energy), 2)
  ) %>%
  arrange(desc(pct_missing))

## # A tibble: 29 × 3
##    variable                                   n_missing pct_missing
##    <chr>                                          <int>       <dbl>
##  1 generation_hydro_pumped_storage_aggregated     35064      100   
##  2 forecast_wind_offshore_eday_ahead              35064      100   
##  3 total_load_actual                                 36        0.1 
##  4 generation_biomass                                19        0.05
##  5 generation_fossil_brown_coal_lignite              18        0.05
##  6 generation_fossil_coal_derived_gas                18        0.05
##  7 generation_fossil_gas                             18        0.05
##  8 generation_fossil_hard_coal                       18        0.05
##  9 generation_fossil_oil                             19        0.05
## 10 generation_fossil_oil_shale                       18        0.05
## # ℹ 19 more rows

# NA summary - Weather
weather %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  mutate(
    pct_missing = round(100 * n_missing / nrow(weather), 2)
  ) %>%
  arrange(desc(pct_missing))

## # A tibble: 17 × 3
##    variable            n_missing pct_missing
##    <chr>                   <int>       <dbl>
##  1 dt_iso                      0           0
##  2 city_name                   0           0
##  3 temp                        0           0
##  4 temp_min                    0           0
##  5 temp_max                    0           0
##  6 pressure                    0           0
##  7 humidity                    0           0
##  8 wind_speed                  0           0
##  9 wind_deg                    0           0
## 10 rain_1h                     0           0
## 11 rain_3h                     0           0
## 12 snow_3h                     0           0
## 13 clouds_all                  0           0
## 14 weather_id                  0           0
## 15 weather_main                0           0
## 16 weather_description         0           0
## 17 weather_icon                0           0

remove columns with 100% missingness and

# Remove columns that contain 100% missing values and provide no useful information
energy <- energy %>%
  select(-any_of(c("generation_hydro_pumped_storage_aggregated",
                   "forecast_wind_offshore_eday_ahead")))

rename time columns to datetime for consistency.

# Rename time columns to 'datetime' for consistency
energy <- energy %>%
  rename(datetime = time)
weather <- weather %>%
  rename(datetime = dt_iso)

and that’s it! We are ready to merge the datasets.

🔍 2. Merging Datasets

Once we have cleaned the datasets, we can merge them based on the datetime column. This will allow us to analyze the relationship between energy consumption and weather conditions.

combined <- left_join(energy, weather, by = "datetime")

We use left_join() to ensure that all records from the energy dataset are retained, even if there are no corresponding weather records.

Note: Weather data includes observations from five major Spanish cities. Since the energy dataset is aggregated at the national level, no spatial distinction is made in this EDA. All weather records are merged and interpreted as representative of national conditions. In a future step, we could aggregate weather by hour (averaging across cities) or isolate individual cities if needed.

📊 3. Exploratory Plots

In this section, we visualize major variables over time to understand trends, patterns, and potential anomalies. All plots are based on the combined dataset, which merges weather and energy information.

📈 3.1 Energy Load: Forecast vs Actual

This plot shows the forecasted vs actual energy load over time. While the general trends match, we can spot some deviations, which may indicate model drift, unexpected events, or forecasting bias.

combined %>%
  select(datetime, total_load_forecast, total_load_actual) %>%
  pivot_longer(-datetime) %>%
  ggplot(aes(x = datetime, y = value, color = name)) +
  geom_line(alpha = 0.6) +
  labs(title = "Energy Load: Forecast vs Actual", y = "Load (MW)", x = "Time") +
  theme_minimal()

🌤️ 3.2 Weather Trends Over Time

We track the evolution of temperature, humidity, and wind speed. The periodic structure (particularly in temperature) suggests strong seasonality. Humidity and wind speed show more variability, potentially correlating with energy production in certain sources.

combined %>%
  select(datetime, temp, wind_speed, humidity) %>%
  pivot_longer(-datetime) %>%
  ggplot(aes(x = datetime, y = value, color = name)) +
  geom_line(alpha = 0.5) +
  facet_wrap(~name, scales = "free_y") +
  labs(title = "Weather Variables Over Time", y = "Value", x = "Time") +
  theme_minimal()

📊 3.3 Price Trends

This chart compares the day-ahead market price with the actual price. Large discrepancies may highlight forecast errors or market volatility. These trends are essential for assessing model accuracy and energy strategy.

combined %>%
  select(datetime, price_day_ahead, price_actual) %>%
  pivot_longer(-datetime) %>%
  ggplot(aes(x = datetime, y = value, color = name)) +
  geom_line(alpha = 0.6) +
  labs(title = "Electricity Prices: Day Ahead vs Actual", y = "€/MWh", x = "Time") +
  theme_minimal()

🔎 3.4 Energy Load - Sample Week

To better understand short-term fluctuations, we zoom into the first week of 2015. We observe clear daily cycles, indicating high periodicity and potential for time-series decomposition.

combined %>%
  filter(datetime >= as.POSIXct("2015-01-01"),
         datetime <= as.POSIXct("2015-01-07")) %>%
  ggplot(aes(x = datetime, y = total_load_actual)) +
  geom_line() +
  labs(title = "Energy Load - Sample Week", y = "MW", x = "Date") +
  theme_minimal()

📅 4. Temporal Patterns

This section analyzes temporal patterns in energy demand by extracting time-based features and visualizing their influence on consumption.

🧩 4.1 Add Time Features

# Extract hour, weekday and month from datetime
combined <- combined %>%
  mutate(
    hour = hour(datetime),
    wday = wday(datetime, label = TRUE, abbr = TRUE),  # e.g., Mon, Tue
    month = month(datetime, label = TRUE, abbr = TRUE) # e.g., Jan, Feb
  )

🕐 4.2 Average Load by Hour

# Average actual load by hour of the day
combined %>%
  group_by(hour) %>%
  summarise(avg_load = mean(total_load_actual, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = avg_load)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(
    title = "Average Energy Load by Hour",
    x = "Hour of Day", y = "Average Load (MW)"
  ) +
  theme_minimal()

💡 Insight: This plot reveals the daily cycle of energy demand, typically with peaks in the morning and evening, and a drop at night.

📅 4.3 Load Distribution by Day of Week

# Distribution of actual load by weekday
combined %>%
  mutate(wday = factor(wday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
  ggplot(aes(x = wday, y = total_load_actual)) +
  geom_boxplot(fill = "skyblue", alpha = 0.6, outlier.color = "gray") +
  labs(
    title = "Energy Load Distribution by Weekday",
    x = "Day of Week", y = "Load (MW)"
  ) +
  theme_minimal()

💡 Insight: Boxplots show variability between weekdays and weekends. Useful to detect behavioral or industrial consumption patterns.

📆 4.4 Load Distribution by Month

# Average load per month across all years
combined %>%
  group_by(month) %>%
  summarise(avg_load = mean(total_load_actual, na.rm = TRUE)) %>%
  ggplot(aes(x = month, y = avg_load, group = 1)) +
  geom_line(color = "darkgreen", linewidth = 1) +
  geom_point(color = "darkgreen", size = 2) +
  labs(
    title = "Monthly Average Energy Load",
    x = "Month", y = "Average Load (MW)"
  ) +
  theme_minimal()

💡 Insight: This plot helps identify seasonal demand patterns. For example, summer or winter peaks could guide forecasting strategies.

📈 5. Correlation Analysis

In this section, we explore the relationships between numeric features, particularly between weather conditions and energy load.

🔍 5.1 Why we do correlation analysis?

Correlation analysis helps us understand the relationships between numeric features in the dataset. This is useful for:

Selecting relevant predictors for machine learning models.
Detecting multicollinearity and removing redundant variables.
Uncovering physical dependencies or seasonal behavior in energy demand.

In this notebook, we use Pearson correlation, which measures linear relationships. We also apply drop_na() to ensure the correlations are calculated on complete data only.

# Select only numeric columns and compute Pearson correlation
correlation_matrix <- combined %>%
  select(where(is.numeric)) %>%
  drop_na() %>%
  cor(method = "pearson")

🔍 5.2 Visualize Correlation Matrix

# Plot correlation matrix using ggcorrplot
ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE,
           lab_size = 2.5,
           colors = c("#6D9EC1", "white", "#E46767"),
           title = "Correlation Matrix (Numerical Features)",
           ggtheme = theme_minimal())

💡 Note: This global correlation matrix provides a technical overview of all numeric features. While useful to detect potential redundancies or collinearities, its density makes it hard to extract actionable insights.

We defer focused interpretation to the next section.

🧠 5.3 Correlation on Energy Load and Weather Variables

To better understand how energy consumption relates to weather, we isolated key variables and computed their pairwise Pearson correlations.

# Focus only on energy load and weather variables
subset_corr <- combined %>%
  select(total_load_actual, total_load_forecast, temp, humidity, wind_speed, pressure) %>%
  drop_na() %>%
  cor()

ggcorrplot(subset_corr, lab = TRUE, type = "lower", ggtheme = theme_minimal())

💡 Key Insights

🔴 total_load_actual and total_load_forecast are very strongly correlated (≈ 1), which confirms the forecast’s high alignment with actual demand.
🔵 temp and humidity exhibit a strong negative correlation (-0.574), reflecting the physical inverse relationship between air temperature and relative humidity.
🟡 The relationship between energy demand (total_load_actual) and weather conditions is weak to moderate:
- Temperature: 0.18 (weak positive)
- Humidity: -0.25 (weak negative)
- Wind speed: 0.13 (very weak positive)
⚪ Pressure is effectively uncorrelated with both energy demand and other variables (all correlations ≈ 0).

These results suggest that while weather contributes to variations in energy usage, its linear influence is limited and possibly non-linear or confounded by other factors like time-of-day or city.

🧠 6. Insights & Next Steps

This exploratory analysis has revealed several key patterns in the energy demand dataset:

📌 Key Insights

Strong correlation between total_load_actual and total_load_forecast confirms the reliability of the forecasts provided.
Temperature shows moderate negative correlation with humidity, hinting at potential seasonal effects.
Energy consumption patterns follow a clear daily cycle, with peaks in the morning and evening, and lower demand at night.
Weekends show slightly lower energy demand, which may reflect reduced industrial activity.
Seasonal variations suggest that monthly factors should be considered when building predictive models.

🚀 Next Steps

Engineer lagged features to capture temporal dependencies.
Create rolling averages or moving windows for smoother time-series signals.
Introduce categorical encodings (e.g., one-hot for weekdays/months).
Prepare train-test split considering time-based validation (e.g., walk-forward).
Begin modeling with baseline approaches (e.g., linear regression, ARIMA, or gradient boosting).

This EDA serves as a foundation to inform feature selection and modeling decisions in the next phase of this project.

As a final result, we export our final combined dataset for future use.

# Create 'processed' folder if it doesn't exist
if (!dir.exists("../data/processed")) {
  dir.create("../data/processed", recursive = TRUE)
}
# Export combined cleaned dataset for use in Python
write_csv(combined, "../data/processed/combined_clean.csv")