This notebook performs an exploratory data analysis (EDA) on the hourly energy demand, generation, pricing, and weather dataset.

This dataset contains four years of hourly data on electricity consumption, generation, prices, and weather conditions in Spain. Energy data was retrieved from ENTSOE (a public TSO portal), pricing data from Red Eléctrica Española, and weather data (for the five largest cities in Spain) was originally obtained via the OpenWeather API and made public by the dataset author.

The main goals of this EDA are:

  • To understand the structure and quality of the datasets.
  • To detect missing or inconsistent values.
  • To identify temporal patterns in energy usage.
  • To prepare the data for feature engineering and modeling.

📦 Load libraries

# Load necessary packages for data wrangling and visualization
library(tidyverse)    # Core data manipulation tools
library(lubridate)    # Date-time parsing and extraction
library(skimr)        # Data summary and structure
library(janitor)      # Clean column names
library(ggplot2)      # Data visualization
library(ggthemes)     # Visualization themes
library(viridis)      # Color palettes
library(ggcorrplot)   # Correlation matrix visualization

📁 Load data

# Read raw CSV files (energy and weather) from 'data/raw' relative to notebook location
energy <- read_csv("../data/raw/energy_dataset.csv")
weather <- read_csv("../data/raw/weather_features.csv")

🧹 Clean column names

# Convert column names to snake_case format
energy <- energy %>% clean_names()
weather <- weather %>% clean_names()

🧹 1. Data Overview & Structure

To begin the analysis, we perform a quick structural inspection using glimpse() and a detailed statistical summary with skim() for both datasets: energy and weather.

🔍 1.1 Glimpse: See Columns and Data Types

🔍 1.1.1 Glimpse: Energy Dataset

# Display the structure of the energy dataset
glimpse(energy)
## Rows: 35,064
## Columns: 29
## $ time                                        <dttm> 2014-12-31 23:00:00, 2015…
## $ generation_biomass                          <dbl> 447, 449, 448, 438, 428, 4…
## $ generation_fossil_brown_coal_lignite        <dbl> 329, 328, 323, 254, 187, 1…
## $ generation_fossil_coal_derived_gas          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_fossil_gas                       <dbl> 4844, 5196, 4857, 4314, 41…
## $ generation_fossil_hard_coal                 <dbl> 4821, 4755, 4581, 4131, 38…
## $ generation_fossil_oil                       <dbl> 162, 158, 157, 160, 156, 1…
## $ generation_fossil_oil_shale                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_fossil_peat                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_geothermal                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_hydro_pumped_storage_aggregated  <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ generation_hydro_pumped_storage_consumption <dbl> 863, 920, 1164, 1503, 1826…
## $ generation_hydro_run_of_river_and_poundage  <dbl> 1051, 1009, 973, 949, 953,…
## $ generation_hydro_water_reservoir            <dbl> 1899, 1658, 1371, 779, 720…
## $ generation_marine                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_nuclear                          <dbl> 7096, 7096, 7099, 7098, 70…
## $ generation_other                            <dbl> 43, 43, 43, 43, 43, 43, 43…
## $ generation_other_renewable                  <dbl> 73, 71, 73, 75, 74, 74, 74…
## $ generation_solar                            <dbl> 49, 50, 50, 50, 42, 34, 34…
## $ generation_waste                            <dbl> 196, 195, 196, 191, 189, 1…
## $ generation_wind_offshore                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ generation_wind_onshore                     <dbl> 6378, 5890, 5461, 5238, 49…
## $ forecast_solar_day_ahead                    <dbl> 17, 16, 8, 2, 9, 4, 3, 12,…
## $ forecast_wind_offshore_eday_ahead           <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ forecast_wind_onshore_day_ahead             <dbl> 6436, 5856, 5454, 5151, 48…
## $ total_load_forecast                         <dbl> 26118, 24934, 23515, 22642…
## $ total_load_actual                           <dbl> 25385, 24382, 22734, 21286…
## $ price_day_ahead                             <dbl> 50.10, 48.10, 47.33, 42.27…
## $ price_actual                                <dbl> 65.41, 64.92, 64.48, 59.32…

The energy dataset contains 35,064 hourly records and 29 columns, which include:

  • Timestamps (time)
  • Energy generation by source (e.g., generation_biomass, generation_solar)
  • Forecast and actual values for energy consumption (total_load_forecast, total_load_actual)
  • Pricing variables (price_day_ahead, price_actual)

The data appears to span from 2014-12-31 to 2018-12-31 with 1-hour intervals.

🔍 1.1.2 Glimpse: Weather Dataset

# Display the structure of the weather dataset
glimpse(weather)
## Rows: 178,396
## Columns: 17
## $ dt_iso              <dttm> 2014-12-31 23:00:00, 2015-01-01 00:00:00, 2015-01…
## $ city_name           <chr> "Valencia", "Valencia", "Valencia", "Valencia", "V…
## $ temp                <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ temp_min            <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ temp_max            <dbl> 270.4750, 270.4750, 269.6860, 269.6860, 269.6860, …
## $ pressure            <dbl> 1001, 1001, 1002, 1002, 1002, 1004, 1004, 1004, 10…
## $ humidity            <dbl> 77, 77, 78, 78, 78, 71, 71, 71, 71, 71, 71, 55, 55…
## $ wind_speed          <dbl> 1, 1, 0, 0, 0, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ wind_deg            <dbl> 62, 62, 23, 23, 23, 321, 321, 321, 307, 307, 307, …
## $ rain_1h             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rain_3h             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ snow_3h             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clouds_all          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather_id          <dbl> 800, 800, 800, 800, 800, 800, 800, 800, 800, 800, …
## $ weather_main        <chr> "clear", "clear", "clear", "clear", "clear", "clea…
## $ weather_description <chr> "sky is clear", "sky is clear", "sky is clear", "s…
## $ weather_icon        <chr> "01n", "01n", "01n", "01n", "01n", "01n", "01n", "…

The weather dataset has 178,396 rows and 17 columns, including:

  • Timestamps (dt_iso)
  • Weather observations for major Spanish cities (city_name)
  • Variables such as temp, humidity, wind_speed, pressure, and weather_description

This dataset also has a datetime range matching the energy dataset, which facilitates temporal merging.

📊 1.2 Skim: Summary Statistics

⚡ 1.2.1 Skim: Energy Dataset

# Display summary statistics for the energy dataset
skim(energy)
Data summary
Name energy
Number of rows 35064
Number of columns 29
_______________________
Column type frequency:
logical 2
numeric 26
POSIXct 1
________________________
Group variables None

Variable type: logical

skim_variable n_missing complete_rate mean count
generation_hydro_pumped_storage_aggregated 35064 0 NaN :
forecast_wind_offshore_eday_ahead 35064 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
generation_biomass 19 1 383.51 85.35 0.00 333.00 367.00 433.00 592.00 ▁▁▇▇▅
generation_fossil_brown_coal_lignite 18 1 448.06 354.57 0.00 0.00 509.00 757.00 999.00 ▇▂▅▅▆
generation_fossil_coal_derived_gas 18 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
generation_fossil_gas 18 1 5622.74 2201.83 0.00 4126.00 4969.00 6429.00 20034.00 ▂▇▂▁▁
generation_fossil_hard_coal 18 1 4256.07 1961.60 0.00 2527.00 4474.00 5838.75 8359.00 ▃▆▆▇▃
generation_fossil_oil 19 1 298.32 52.52 0.00 263.00 300.00 330.00 449.00 ▁▁▃▇▂
generation_fossil_oil_shale 18 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
generation_fossil_peat 18 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
generation_geothermal 18 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
generation_hydro_pumped_storage_consumption 19 1 475.58 792.41 0.00 0.00 68.00 616.00 4523.00 ▇▁▁▁▁
generation_hydro_run_of_river_and_poundage 19 1 972.12 400.78 0.00 637.00 906.00 1250.00 2000.00 ▁▇▆▃▂
generation_hydro_water_reservoir 18 1 2605.11 1835.20 0.00 1077.25 2164.00 3757.00 9728.00 ▇▆▃▁▁
generation_marine 19 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
generation_nuclear 17 1 6263.91 839.67 0.00 5760.00 6566.00 7025.00 7117.00 ▁▁▁▂▇
generation_other 18 1 60.23 20.24 0.00 53.00 57.00 80.00 106.00 ▁▁▇▂▂
generation_other_renewable 18 1 85.64 14.08 0.00 73.00 88.00 97.00 119.00 ▁▁▃▇▅
generation_solar 18 1 1432.67 1680.12 0.00 71.00 616.00 2578.00 5792.00 ▇▂▁▁▁
generation_waste 19 1 269.45 50.20 0.00 240.00 279.00 310.00 357.00 ▁▁▂▇▇
generation_wind_offshore 18 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
generation_wind_onshore 18 1 5464.48 3213.69 0.00 2933.00 4849.00 7398.00 17436.00 ▇▇▅▂▁
forecast_solar_day_ahead 0 1 1439.07 1677.70 0.00 69.00 576.00 2636.00 5836.00 ▇▂▁▂▁
forecast_wind_onshore_day_ahead 0 1 5471.22 3176.31 237.00 2979.00 4855.00 7353.00 17430.00 ▇▇▃▂▁
total_load_forecast 0 1 28712.13 4594.10 18105.00 24793.75 28906.00 32263.25 41390.00 ▂▇▇▆▁
total_load_actual 36 1 28696.94 4574.99 18041.00 24807.75 28901.00 32192.00 41015.00 ▂▇▇▆▁
price_day_ahead 0 1 49.87 14.62 2.06 41.49 50.52 60.53 101.99 ▁▃▇▃▁
price_actual 0 1 57.88 14.20 9.33 49.35 58.02 68.01 116.80 ▁▅▇▂▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time 0 1 2014-12-31 23:00:00 2018-12-31 22:00:00 2016-12-31 10:30:00 35064
  • Most columns are numeric (dbl), with one datetime column (POSIXct) and two logical columns that represent completely missing data (NA only).
  • Missing data is generally very low (≤ 0.1%) across most variables, except for two:
    • generation_hydro_pumped_storage_aggregated
    • forecast_wind_offshore_eday_ahead These two were removed from the dataset due to having 100% missing values.
  • Some generation sources such as coal-derived gas, oil shale, and geothermal are recorded as zeros or missing throughout the entire time series — this might be expected depending on Spain’s energy infrastructure.

🌦️ 1.2.2 Skim: Weather Dataset

# Display summary statistics for the weather dataset
skim(weather)
Data summary
Name weather
Number of rows 178396
Number of columns 17
_______________________
Column type frequency:
character 4
numeric 12
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city_name 0 1 6 9 0 5 0
weather_main 0 1 3 12 0 12 0
weather_description 0 1 3 28 0 43 0
weather_icon 0 1 2 3 0 24 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
temp 0 1 289.62 8.03 262.24 283.67 289.15 295.15 315.60 ▁▅▇▅▁
temp_min 0 1 288.33 7.96 262.24 282.48 288.15 293.73 315.15 ▁▅▇▃▁
temp_max 0 1 291.09 8.61 262.24 284.65 290.15 297.15 321.15 ▁▅▇▃▁
pressure 0 1 1069.26 5969.63 0.00 1013.00 1018.00 1022.00 1008371.00 ▇▁▁▁▁
humidity 0 1 68.42 21.90 0.00 53.00 72.00 87.00 100.00 ▁▂▅▆▇
wind_speed 0 1 2.47 2.10 0.00 1.00 2.00 4.00 133.00 ▇▁▁▁▁
wind_deg 0 1 166.59 116.61 0.00 55.00 177.00 270.00 360.00 ▇▅▃▆▆
rain_1h 0 1 0.08 0.40 0.00 0.00 0.00 0.00 12.00 ▇▁▁▁▁
rain_3h 0 1 0.00 0.01 0.00 0.00 0.00 0.00 2.32 ▇▁▁▁▁
snow_3h 0 1 0.00 0.22 0.00 0.00 0.00 0.00 21.50 ▇▁▁▁▁
clouds_all 0 1 25.07 30.77 0.00 0.00 20.00 40.00 100.00 ▇▁▁▂▁
weather_id 0 1 759.83 108.73 200.00 800.00 800.00 801.00 804.00 ▁▁▁▁▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
dt_iso 0 1 2014-12-31 23:00:00 2018-12-31 22:00:00 2017-01-05 05:00:00 35064
  • Weather data is well-structured with no missing values in any column.
  • It includes:
    • Numeric weather conditions (temp, pressure, wind_speed, etc.)
    • Categorical weather types (weather_main, weather_description)
    • Identifiers and icons from the API (weather_id, weather_icon)

This dataset is well-suited for modeling, with consistent sampling and granularity.

📌 1.3 Conclusion of Initial Inspection

  • Both datasets are clean and well-structured.
  • We removed only two columns due to full missingness.
  • The time coverage and granularity are compatible, making them ideal for feature engineering and merging.
  • From this point forward, we can derive temporal features, visualize trends, and assess correlations.

Finally we treat our datasets based on out observations:

  • We double-check for missing values and
# Summarize missing values per variable in both datasets to assess data quality
# NA summary - Energy
energy %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  mutate(
    pct_missing = round(100 * n_missing / nrow(energy), 2)
  ) %>%
  arrange(desc(pct_missing))
## # A tibble: 29 × 3
##    variable                                   n_missing pct_missing
##    <chr>                                          <int>       <dbl>
##  1 generation_hydro_pumped_storage_aggregated     35064      100   
##  2 forecast_wind_offshore_eday_ahead              35064      100   
##  3 total_load_actual                                 36        0.1 
##  4 generation_biomass                                19        0.05
##  5 generation_fossil_brown_coal_lignite              18        0.05
##  6 generation_fossil_coal_derived_gas                18        0.05
##  7 generation_fossil_gas                             18        0.05
##  8 generation_fossil_hard_coal                       18        0.05
##  9 generation_fossil_oil                             19        0.05
## 10 generation_fossil_oil_shale                       18        0.05
## # ℹ 19 more rows
# NA summary - Weather
weather %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  mutate(
    pct_missing = round(100 * n_missing / nrow(weather), 2)
  ) %>%
  arrange(desc(pct_missing))
## # A tibble: 17 × 3
##    variable            n_missing pct_missing
##    <chr>                   <int>       <dbl>
##  1 dt_iso                      0           0
##  2 city_name                   0           0
##  3 temp                        0           0
##  4 temp_min                    0           0
##  5 temp_max                    0           0
##  6 pressure                    0           0
##  7 humidity                    0           0
##  8 wind_speed                  0           0
##  9 wind_deg                    0           0
## 10 rain_1h                     0           0
## 11 rain_3h                     0           0
## 12 snow_3h                     0           0
## 13 clouds_all                  0           0
## 14 weather_id                  0           0
## 15 weather_main                0           0
## 16 weather_description         0           0
## 17 weather_icon                0           0
  • remove columns with 100% missingness and
# Remove columns that contain 100% missing values and provide no useful information
energy <- energy %>%
  select(-any_of(c("generation_hydro_pumped_storage_aggregated",
                   "forecast_wind_offshore_eday_ahead")))
  • rename time columns to datetime for consistency.
# Rename time columns to 'datetime' for consistency
energy <- energy %>%
  rename(datetime = time)
weather <- weather %>%
  rename(datetime = dt_iso)

and that’s it! We are ready to merge the datasets.

🔍 2. Merging Datasets

Once we have cleaned the datasets, we can merge them based on the datetime column. This will allow us to analyze the relationship between energy consumption and weather conditions.

combined <- left_join(energy, weather, by = "datetime")

We use left_join() to ensure that all records from the energy dataset are retained, even if there are no corresponding weather records.

Note: Weather data includes observations from five major Spanish cities. Since the energy dataset is aggregated at the national level, no spatial distinction is made in this EDA. All weather records are merged and interpreted as representative of national conditions. In a future step, we could aggregate weather by hour (averaging across cities) or isolate individual cities if needed.

📊 3. Exploratory Plots

In this section, we visualize major variables over time to understand trends, patterns, and potential anomalies. All plots are based on the combined dataset, which merges weather and energy information.

📈 3.1 Energy Load: Forecast vs Actual

This plot shows the forecasted vs actual energy load over time. While the general trends match, we can spot some deviations, which may indicate model drift, unexpected events, or forecasting bias.

combined %>%
  select(datetime, total_load_forecast, total_load_actual) %>%
  pivot_longer(-datetime) %>%
  ggplot(aes(x = datetime, y = value, color = name)) +
  geom_line(alpha = 0.6) +
  labs(title = "Energy Load: Forecast vs Actual", y = "Load (MW)", x = "Time") +
  theme_minimal()

🔎 3.4 Energy Load - Sample Week

To better understand short-term fluctuations, we zoom into the first week of 2015. We observe clear daily cycles, indicating high periodicity and potential for time-series decomposition.

combined %>%
  filter(datetime >= as.POSIXct("2015-01-01"),
         datetime <= as.POSIXct("2015-01-07")) %>%
  ggplot(aes(x = datetime, y = total_load_actual)) +
  geom_line() +
  labs(title = "Energy Load - Sample Week", y = "MW", x = "Date") +
  theme_minimal()

📅 4. Temporal Patterns

This section analyzes temporal patterns in energy demand by extracting time-based features and visualizing their influence on consumption.

🧩 4.1 Add Time Features

# Extract hour, weekday and month from datetime
combined <- combined %>%
  mutate(
    hour = hour(datetime),
    wday = wday(datetime, label = TRUE, abbr = TRUE),  # e.g., Mon, Tue
    month = month(datetime, label = TRUE, abbr = TRUE) # e.g., Jan, Feb
  )

🕐 4.2 Average Load by Hour

# Average actual load by hour of the day
combined %>%
  group_by(hour) %>%
  summarise(avg_load = mean(total_load_actual, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = avg_load)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(
    title = "Average Energy Load by Hour",
    x = "Hour of Day", y = "Average Load (MW)"
  ) +
  theme_minimal()

💡 Insight: This plot reveals the daily cycle of energy demand, typically with peaks in the morning and evening, and a drop at night.

📅 4.3 Load Distribution by Day of Week

# Distribution of actual load by weekday
combined %>%
  mutate(wday = factor(wday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
  ggplot(aes(x = wday, y = total_load_actual)) +
  geom_boxplot(fill = "skyblue", alpha = 0.6, outlier.color = "gray") +
  labs(
    title = "Energy Load Distribution by Weekday",
    x = "Day of Week", y = "Load (MW)"
  ) +
  theme_minimal()

💡 Insight: Boxplots show variability between weekdays and weekends. Useful to detect behavioral or industrial consumption patterns.

📆 4.4 Load Distribution by Month

# Average load per month across all years
combined %>%
  group_by(month) %>%
  summarise(avg_load = mean(total_load_actual, na.rm = TRUE)) %>%
  ggplot(aes(x = month, y = avg_load, group = 1)) +
  geom_line(color = "darkgreen", linewidth = 1) +
  geom_point(color = "darkgreen", size = 2) +
  labs(
    title = "Monthly Average Energy Load",
    x = "Month", y = "Average Load (MW)"
  ) +
  theme_minimal()

💡 Insight: This plot helps identify seasonal demand patterns. For example, summer or winter peaks could guide forecasting strategies.

📈 5. Correlation Analysis

In this section, we explore the relationships between numeric features, particularly between weather conditions and energy load.

🔍 5.1 Why we do correlation analysis?

Correlation analysis helps us understand the relationships between numeric features in the dataset. This is useful for:

  • Selecting relevant predictors for machine learning models.
  • Detecting multicollinearity and removing redundant variables.
  • Uncovering physical dependencies or seasonal behavior in energy demand.

In this notebook, we use Pearson correlation, which measures linear relationships. We also apply drop_na() to ensure the correlations are calculated on complete data only.

# Select only numeric columns and compute Pearson correlation
correlation_matrix <- combined %>%
  select(where(is.numeric)) %>%
  drop_na() %>%
  cor(method = "pearson")

🔍 5.2 Visualize Correlation Matrix

# Plot correlation matrix using ggcorrplot
ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE,
           lab_size = 2.5,
           colors = c("#6D9EC1", "white", "#E46767"),
           title = "Correlation Matrix (Numerical Features)",
           ggtheme = theme_minimal())

💡 Note: This global correlation matrix provides a technical overview of all numeric features. While useful to detect potential redundancies or collinearities, its density makes it hard to extract actionable insights.

We defer focused interpretation to the next section.

🧠 5.3 Correlation on Energy Load and Weather Variables

To better understand how energy consumption relates to weather, we isolated key variables and computed their pairwise Pearson correlations.

# Focus only on energy load and weather variables
subset_corr <- combined %>%
  select(total_load_actual, total_load_forecast, temp, humidity, wind_speed, pressure) %>%
  drop_na() %>%
  cor()

ggcorrplot(subset_corr, lab = TRUE, type = "lower", ggtheme = theme_minimal())

💡 Key Insights

  • 🔴 total_load_actual and total_load_forecast are very strongly correlated (≈ 1), which confirms the forecast’s high alignment with actual demand.
  • 🔵 temp and humidity exhibit a strong negative correlation (-0.574), reflecting the physical inverse relationship between air temperature and relative humidity.
  • 🟡 The relationship between energy demand (total_load_actual) and weather conditions is weak to moderate:
    • Temperature: 0.18 (weak positive)
    • Humidity: -0.25 (weak negative)
    • Wind speed: 0.13 (very weak positive)
  • Pressure is effectively uncorrelated with both energy demand and other variables (all correlations ≈ 0).

These results suggest that while weather contributes to variations in energy usage, its linear influence is limited and possibly non-linear or confounded by other factors like time-of-day or city.

🧠 6. Insights & Next Steps

This exploratory analysis has revealed several key patterns in the energy demand dataset:

📌 Key Insights

  • Strong correlation between total_load_actual and total_load_forecast confirms the reliability of the forecasts provided.
  • Temperature shows moderate negative correlation with humidity, hinting at potential seasonal effects.
  • Energy consumption patterns follow a clear daily cycle, with peaks in the morning and evening, and lower demand at night.
  • Weekends show slightly lower energy demand, which may reflect reduced industrial activity.
  • Seasonal variations suggest that monthly factors should be considered when building predictive models.

🚀 Next Steps

  • Engineer lagged features to capture temporal dependencies.
  • Create rolling averages or moving windows for smoother time-series signals.
  • Introduce categorical encodings (e.g., one-hot for weekdays/months).
  • Prepare train-test split considering time-based validation (e.g., walk-forward).
  • Begin modeling with baseline approaches (e.g., linear regression, ARIMA, or gradient boosting).

This EDA serves as a foundation to inform feature selection and modeling decisions in the next phase of this project.

As a final result, we export our final combined dataset for future use.

# Create 'processed' folder if it doesn't exist
if (!dir.exists("../data/processed")) {
  dir.create("../data/processed", recursive = TRUE)
}
# Export combined cleaned dataset for use in Python
write_csv(combined, "../data/processed/combined_clean.csv")