Scope

At the beginning of the semester, I asked my friend, who works in the data science department at PSE (Polskie Sieci Elektroenergetyczne), about potential applications of clustering in power engineering. He explained that, to simulate the impact of various investments and grid development strategies, they use clustering techniques to group years based on historical climate data. These clusters serve as the basis for generating representative scenarios, which are then used in simulation models to analyze different development strategies.

i tried to do something similar.

PECD (Pan-European Climate Database) is a dataset with historical and simulated climate data for Europe. It’s widely used in energy system modeling, especially in power system analysis, to assess how weather conditions impact electricity demand, renewable energy generation (like wind and solar), and overall grid stability.

As they probably have more data from the power grid and more powerful computers, my analysis is simpler, but showed me a good understanding of dimension reduction and clustering evaluations.

Used Libraries

library(lubridate)
library(tidyverse)
library(corrplot)
library(clusterSim)
library(factoextra)
library(cluster)
library(scatterplot3d)
library(plotly)
library(ggplot2)
library(kableExtra)

Data processing

I downloaded data from Climate and energy indicators for Europe from 1979 to present derived from reanalysis provided by Copernicus UE - documentation. I pre-processed the data to get variables for Poland and rows for the years. Data are annual aggregates.

data <- read.csv("merged_data.csv")

data$Hydropower_Reservoir <- NULL
data$Hydropower_Run_Of_River <- NULL

data$Date <- as.Date(data$Date, format = "%Y-%m-%d")

data <- data %>%
  mutate(Date = year(Date))

Below I present graphs of various climatic variations between years. The average temperature is indeed rising. I was surprised by the demand for electricity, as far as I know it should also be rising. Nevertheless, the data is as it is and my goal is to run PCA and clustering on it.

data_long <- pivot_longer(data, cols = -Date, names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Date, y = Value, color = Variable)) +
  geom_line(size = 1) +  
  facet_wrap(~ Variable, scales = "free_y") + 
  theme_minimal() +
  labs(x = "year", y = "value") +
  theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Data has no NA values and has 9 columns. Next I will try to reduce the variables to a few most meaningful ones.

rownames(data) <- data$Date
data$Date <- NULL
str(data)
head(data)
dim(data)
sum(is.na(data))
## 'data.frame':    45 obs. of  9 variables:
##  $ Wind_Power_Onshore          : num  9748479 9811121 10019283 8426648 10595300 ...
##  $ Wind_Speed_100m             : num  6.28 6.29 6.38 5.86 6.58 ...
##  $ Wind_Speed_10m              : num  3.72 3.76 3.8 3.45 3.92 ...
##  $ Total_Precipitation         : num  0.736 0.885 0.891 0.583 0.719 ...
##  $ Solar_PV_Power              : num  1258776 1116505 1176868 1314124 1214406 ...
##  $ Mean_Sea_Level_Pressure     : num  101514 101506 101415 101723 101539 ...
##  $ Global_Horizontal_Irradiance: num  117 108 111 121 115 ...
##  $ Electricity_Demand          : num  1.50e+08 1.51e+08 1.50e+08 1.50e+08 1.49e+08 ...
##  $ Air_Temperature             : num  280 280 281 282 282 ...
##      Wind_Power_Onshore Wind_Speed_100m Wind_Speed_10m Total_Precipitation
## 1979            9748479        6.276812       3.722263           0.7362319
## 1980            9811121        6.289060       3.760425           0.8847634
## 1981           10019283        6.375627       3.801285           0.8911276
## 1982            8426648        5.855022       3.450780           0.5828524
## 1983           10595300        6.581627       3.919874           0.7189458
## 1984            9410046        6.183150       3.682740           0.7257658
##      Solar_PV_Power Mean_Sea_Level_Pressure Global_Horizontal_Irradiance
## 1979        1258776                101514.1                     117.4064
## 1980        1116505                101506.0                     107.8236
## 1981        1176868                101414.6                     110.9951
## 1982        1314124                101723.1                     121.2929
## 1983        1214406                101538.8                     114.9516
## 1984        1176614                101594.8                     113.3527
##      Electricity_Demand Air_Temperature
## 1979          150351269        280.3978
## 1980          150871484        279.6771
## 1981          149941121        280.9439
## 1982          149607568        281.5182
## 1983          149264931        282.1330
## 1984          149703257        280.9075
## [1] 45  9
## [1] 0

At the end of EDA lets check the distributions of variables.

par(mfrow = c(3, 3))

for(col in names(data)) {
  hist(data[[col]], main = col, col = "lightblue", breaks = 30)
}

Dimensions Reduction

Normalizing data

As data has different units, i should be normalized. PCA works on scaled data.

data.n <- scale(data)
dim(data.n)

Is data clusterable?

Before PCA, I want to check if the data is clusterable at all. As you can see below, the Hopkins statistic is really bad and suggests uninformative clusters.

get_clust_tendency(data, 10, graph=T, gradient=list(low="red", mid="white", high="blue"), seed = 123) 
## $hopkins_stat
## [1] 0.4930455
## 
## $plot

Correlation

As PCA reduces redundancy of data, it should work the best on correlated data.

Common sense suggests that climatic variables such as temperature, pressure, and wind speed are highly correlated. However, to ensure an objective approach, I assume that I have no prior knowledge of these physical phenomena.

Below I check the correlations using corrplot() and I can see that the variables are highly correlated. Wind speed at 10m corresponds to wind speed at 100m and wind power production. Air temperature is negatively correlated with electricity demand, suggesting that in warmer years less electricity can be used for heating. Precipitation is negatively correlated with pressure.

corrplot(cor(data))

Principal Components Analysis

PCA is a method which reduces the data into most meaningful variables. Method provides set of uncorrelated principles components that describes given set as best as possible. The first principal component captures the most variance in the data set, the second captures the next most, and so on.

I performed analysis on my scaled numeric data.

Spectral decomposition approach

pca<-princomp(data.n)
fviz_pca_var(pca, col.var="steelblue", repel=TRUE, ylim=c(-1,1), xlim=c(-1,1))

fviz_eig(pca)   

summary(pca)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.0422755 1.5805592 1.1271762 0.69511060 0.48261301
## Proportion of Variance 0.4739647 0.2838827 0.1443780 0.05490668 0.02646765
## Cumulative Proportion  0.4739647 0.7578473 0.9022253 0.95713201 0.98359966
##                             Comp.6      Comp.7       Comp.8       Comp.9
## Standard deviation     0.290288594 0.225780992 0.0865864714 0.0397653930
## Proportion of Variance 0.009575849 0.005792847 0.0008519565 0.0001796916
## Cumulative Proportion  0.993175505 0.998968352 0.9998203084 1.0000000000

Singular value decomposition (SVD):

pca2<-prcomp(data.n)
fviz_pca_var(pca2, col.var="steelblue", repel=TRUE, ylim=c(-1,1), xlim=c(-1,1))
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

fviz_eig(pca2)  

summary(pca2)
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.065 1.5984 1.1399 0.70297 0.48807 0.29357 0.22833
## Proportion of Variance 0.474 0.2839 0.1444 0.05491 0.02647 0.00958 0.00579
## Cumulative Proportion  0.474 0.7579 0.9022 0.95713 0.98360 0.99318 0.99897
##                            PC8     PC9
## Standard deviation     0.08756 0.04021
## Proportion of Variance 0.00085 0.00018
## Cumulative Proportion  0.99982 1.00000

The output is large (expected). 90% of my data can be described by 3 principal components. The scree plot shows how much of the variance is explained by each component. The arrow plot visualises the dependent variables.

Now I can see the contribution of each variable between the components (graphs below). The first component depends mostly on the variable ‘the sun’, the second on demand and the third on pressure. I ran the analysis on both prcomp() and princomp() functions. As the results are not different I chose princcomp() for further analysis.

var<-get_pca_var(pca)
fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)

fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)

fviz_contrib(pca, "var", axes=3, xtickslab.rt=90)

Clustering

Raw data

Let’s check the best number of clusters on the raw data using both gap statistics and silhouette.

Based on graphs and calculated Hopkin’s data are poorly clusterable. The best number of clusters is 1 or 7 (based on silhouette).

?fviz_nbclust

fviz_nbclust(data, FUNcluster = pam, method="gap_stat")

fviz_nbclust(data, FUNcluster = pam, method="silhouette")

I performed clustering on raw data and the average silhouette is moderate. I am choosing PAM in order to get representative years for each cluster.

pam_res<-eclust(data, "pam", hc_metric="euclidian",k=7)

fviz_silhouette(pam_res)

##   cluster size ave.sil.width
## 1       1    7          0.29
## 2       2    9          0.35
## 3       3   10          0.52
## 4       4    8          0.49
## 5       5    6          0.44
## 6       6    1          0.00
## 7       7    4          0.26

The years which best describes each clusters are given and the mean values of cluster are calculated.

cluster Wind_Power_Onshore Wind_Speed_100m Wind_Speed_10m Total_Precipitation Solar_PV_Power Mean_Sea_Level_Pressure Global_Horizontal_Irradiance Electricity_Demand Air_Temperature
1985 9228255 6.10 3.62 0.81 1227777 101541.7 115.77 150758605 280.26
1993 10191066 6.42 3.82 0.78 1234803 101549.3 116.71 149660242 281.52
2013 8551936 5.90 3.49 0.72 1311163 101617.1 123.12 149663850 281.88
2004 9419532 6.17 3.66 0.73 1266065 101600.1 118.78 149596511 281.57
2015 9778886 6.29 3.73 0.71 1292318 101583.7 122.55 148770207 282.89
1990 10415838 6.50 3.86 0.72 1244594 101569.6 116.99 147951251 282.27
2023 8963967 6.03 3.57 0.77 1306702 101548.8 126.06 148786166 282.91

Reduced data

I also conducted the same clustering analysis on reduced data.

pca_df <- as.data.frame(pca$scores[, 1:3])
colnames(pca_df) <- c("PCA1", "PCA2", "PCA3")
dim(pca_df)
## [1] 45  3

Reducing data gives possibility to plot the on 3d graph.

scatterplot3d(pca_df$PCA1, pca_df$PCA2, pca_df$PCA3, 
              color = "blue", pch = 16, main = "PCA 3D")

Now preanalysis suggests 2 clusters.

fviz_nbclust(pca_df, FUNcluster=pam, method="gap_stat")

fviz_nbclust(pca_df, FUNcluster=pam, method="silhouette")

Results are worse than on the raw data. PCA helped to visualize data but has worsen clustering evaluation.

pam_pca_res<-eclust(pca_df, "pam", hc_metric="euclidian",k=2)

fviz_silhouette(pam_pca_res)

3d visualization of clusters:

cluster Wind_Power_Onshore Wind_Speed_100m Wind_Speed_10m Total_Precipitation Solar_PV_Power Mean_Sea_Level_Pressure Global_Horizontal_Irradiance Electricity_Demand Air_Temperature
1 9730571 6.27 3.72 0.78 1244930 101545.1 117.77 149568212 281.62
2 8681392 5.94 3.51 0.70 1321682 101641.2 124.38 149622252 281.97
## [1] "Medoids:"
##           PCA1       PCA2        PCA3
## 1995  1.064503  0.1124647 -0.02568238
## 2005 -2.818484 -0.3508842  0.33036523

Conclusions

PCA allowed a reduction of variables and a better interpretation of the data, but did not improve the clustering results. The raw data were more ‘clusterable’.

It is possible to find representative years and group them by climate data, but I feel that what I have done is somehow unnecessary. It makes more sense on a larger dataset. One could add the same factors like general happiness, energy price, etc. or more technical data on grid performance.

I checked if the medoids are historically important, like if the “zima stulecia” (engl. winter of the century) or some other big catastrophe was in those years, but did not come to any conclusions.