At the beginning of the semester, I asked my friend, who works in the data science department at PSE (Polskie Sieci Elektroenergetyczne), about potential applications of clustering in power engineering. He explained that, to simulate the impact of various investments and grid development strategies, they use clustering techniques to group years based on historical climate data. These clusters serve as the basis for generating representative scenarios, which are then used in simulation models to analyze different development strategies.
i tried to do something similar.
PECD (Pan-European Climate Database) is a dataset with historical and simulated climate data for Europe. It’s widely used in energy system modeling, especially in power system analysis, to assess how weather conditions impact electricity demand, renewable energy generation (like wind and solar), and overall grid stability.
As they probably have more data from the power grid and more powerful computers, my analysis is simpler, but showed me a good understanding of dimension reduction and clustering evaluations.
library(lubridate)
library(tidyverse)
library(corrplot)
library(clusterSim)
library(factoextra)
library(cluster)
library(scatterplot3d)
library(plotly)
library(ggplot2)
library(kableExtra)
I downloaded data from Climate and energy indicators for Europe from 1979 to present derived from reanalysis provided by Copernicus UE - documentation. I pre-processed the data to get variables for Poland and rows for the years. Data are annual aggregates.
data <- read.csv("merged_data.csv")
data$Hydropower_Reservoir <- NULL
data$Hydropower_Run_Of_River <- NULL
data$Date <- as.Date(data$Date, format = "%Y-%m-%d")
data <- data %>%
mutate(Date = year(Date))
Below I present graphs of various climatic variations between years. The average temperature is indeed rising. I was surprised by the demand for electricity, as far as I know it should also be rising. Nevertheless, the data is as it is and my goal is to run PCA and clustering on it.
data_long <- pivot_longer(data, cols = -Date, names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Date, y = Value, color = Variable)) +
geom_line(size = 1) +
facet_wrap(~ Variable, scales = "free_y") +
theme_minimal() +
labs(x = "year", y = "value") +
theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Data has no NA values and has 9 columns. Next I will try to reduce the variables to a few most meaningful ones.
rownames(data) <- data$Date
data$Date <- NULL
str(data)
head(data)
dim(data)
sum(is.na(data))
## 'data.frame': 45 obs. of 9 variables:
## $ Wind_Power_Onshore : num 9748479 9811121 10019283 8426648 10595300 ...
## $ Wind_Speed_100m : num 6.28 6.29 6.38 5.86 6.58 ...
## $ Wind_Speed_10m : num 3.72 3.76 3.8 3.45 3.92 ...
## $ Total_Precipitation : num 0.736 0.885 0.891 0.583 0.719 ...
## $ Solar_PV_Power : num 1258776 1116505 1176868 1314124 1214406 ...
## $ Mean_Sea_Level_Pressure : num 101514 101506 101415 101723 101539 ...
## $ Global_Horizontal_Irradiance: num 117 108 111 121 115 ...
## $ Electricity_Demand : num 1.50e+08 1.51e+08 1.50e+08 1.50e+08 1.49e+08 ...
## $ Air_Temperature : num 280 280 281 282 282 ...
## Wind_Power_Onshore Wind_Speed_100m Wind_Speed_10m Total_Precipitation
## 1979 9748479 6.276812 3.722263 0.7362319
## 1980 9811121 6.289060 3.760425 0.8847634
## 1981 10019283 6.375627 3.801285 0.8911276
## 1982 8426648 5.855022 3.450780 0.5828524
## 1983 10595300 6.581627 3.919874 0.7189458
## 1984 9410046 6.183150 3.682740 0.7257658
## Solar_PV_Power Mean_Sea_Level_Pressure Global_Horizontal_Irradiance
## 1979 1258776 101514.1 117.4064
## 1980 1116505 101506.0 107.8236
## 1981 1176868 101414.6 110.9951
## 1982 1314124 101723.1 121.2929
## 1983 1214406 101538.8 114.9516
## 1984 1176614 101594.8 113.3527
## Electricity_Demand Air_Temperature
## 1979 150351269 280.3978
## 1980 150871484 279.6771
## 1981 149941121 280.9439
## 1982 149607568 281.5182
## 1983 149264931 282.1330
## 1984 149703257 280.9075
## [1] 45 9
## [1] 0
At the end of EDA lets check the distributions of variables.
par(mfrow = c(3, 3))
for(col in names(data)) {
hist(data[[col]], main = col, col = "lightblue", breaks = 30)
}
As data has different units, i should be normalized. PCA works on scaled data.
data.n <- scale(data)
dim(data.n)
Before PCA, I want to check if the data is clusterable at all. As you can see below, the Hopkins statistic is really bad and suggests uninformative clusters.
get_clust_tendency(data, 10, graph=T, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.4930455
##
## $plot
As PCA reduces redundancy of data, it should work the best on correlated data.
Common sense suggests that climatic variables such as temperature, pressure, and wind speed are highly correlated. However, to ensure an objective approach, I assume that I have no prior knowledge of these physical phenomena.
Below I check the correlations using corrplot() and I can see that the variables are highly correlated. Wind speed at 10m corresponds to wind speed at 100m and wind power production. Air temperature is negatively correlated with electricity demand, suggesting that in warmer years less electricity can be used for heating. Precipitation is negatively correlated with pressure.
corrplot(cor(data))
PCA is a method which reduces the data into most meaningful variables. Method provides set of uncorrelated principles components that describes given set as best as possible. The first principal component captures the most variance in the data set, the second captures the next most, and so on.
I performed analysis on my scaled numeric data.
pca<-princomp(data.n)
fviz_pca_var(pca, col.var="steelblue", repel=TRUE, ylim=c(-1,1), xlim=c(-1,1))
fviz_eig(pca)
summary(pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.0422755 1.5805592 1.1271762 0.69511060 0.48261301
## Proportion of Variance 0.4739647 0.2838827 0.1443780 0.05490668 0.02646765
## Cumulative Proportion 0.4739647 0.7578473 0.9022253 0.95713201 0.98359966
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.290288594 0.225780992 0.0865864714 0.0397653930
## Proportion of Variance 0.009575849 0.005792847 0.0008519565 0.0001796916
## Cumulative Proportion 0.993175505 0.998968352 0.9998203084 1.0000000000
pca2<-prcomp(data.n)
fviz_pca_var(pca2, col.var="steelblue", repel=TRUE, ylim=c(-1,1), xlim=c(-1,1))
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
fviz_eig(pca2)
summary(pca2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.065 1.5984 1.1399 0.70297 0.48807 0.29357 0.22833
## Proportion of Variance 0.474 0.2839 0.1444 0.05491 0.02647 0.00958 0.00579
## Cumulative Proportion 0.474 0.7579 0.9022 0.95713 0.98360 0.99318 0.99897
## PC8 PC9
## Standard deviation 0.08756 0.04021
## Proportion of Variance 0.00085 0.00018
## Cumulative Proportion 0.99982 1.00000
The output is large (expected). 90% of my data can be described by 3 principal components. The scree plot shows how much of the variance is explained by each component. The arrow plot visualises the dependent variables.
Now I can see the contribution of each variable between the components (graphs below). The first component depends mostly on the variable ‘the sun’, the second on demand and the third on pressure. I ran the analysis on both prcomp() and princomp() functions. As the results are not different I chose princcomp() for further analysis.
var<-get_pca_var(pca)
fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
fviz_contrib(pca, "var", axes=3, xtickslab.rt=90)
Let’s check the best number of clusters on the raw data using both gap statistics and silhouette.
Based on graphs and calculated Hopkin’s data are poorly clusterable. The best number of clusters is 1 or 7 (based on silhouette).
?fviz_nbclust
fviz_nbclust(data, FUNcluster = pam, method="gap_stat")
fviz_nbclust(data, FUNcluster = pam, method="silhouette")
I performed clustering on raw data and the average silhouette is moderate. I am choosing PAM in order to get representative years for each cluster.
pam_res<-eclust(data, "pam", hc_metric="euclidian",k=7)
fviz_silhouette(pam_res)
## cluster size ave.sil.width
## 1 1 7 0.29
## 2 2 9 0.35
## 3 3 10 0.52
## 4 4 8 0.49
## 5 5 6 0.44
## 6 6 1 0.00
## 7 7 4 0.26
The years which best describes each clusters are given and the mean values of cluster are calculated.
cluster | Wind_Power_Onshore | Wind_Speed_100m | Wind_Speed_10m | Total_Precipitation | Solar_PV_Power | Mean_Sea_Level_Pressure | Global_Horizontal_Irradiance | Electricity_Demand | Air_Temperature |
---|---|---|---|---|---|---|---|---|---|
1985 | 9228255 | 6.10 | 3.62 | 0.81 | 1227777 | 101541.7 | 115.77 | 150758605 | 280.26 |
1993 | 10191066 | 6.42 | 3.82 | 0.78 | 1234803 | 101549.3 | 116.71 | 149660242 | 281.52 |
2013 | 8551936 | 5.90 | 3.49 | 0.72 | 1311163 | 101617.1 | 123.12 | 149663850 | 281.88 |
2004 | 9419532 | 6.17 | 3.66 | 0.73 | 1266065 | 101600.1 | 118.78 | 149596511 | 281.57 |
2015 | 9778886 | 6.29 | 3.73 | 0.71 | 1292318 | 101583.7 | 122.55 | 148770207 | 282.89 |
1990 | 10415838 | 6.50 | 3.86 | 0.72 | 1244594 | 101569.6 | 116.99 | 147951251 | 282.27 |
2023 | 8963967 | 6.03 | 3.57 | 0.77 | 1306702 | 101548.8 | 126.06 | 148786166 | 282.91 |
I also conducted the same clustering analysis on reduced data.
pca_df <- as.data.frame(pca$scores[, 1:3])
colnames(pca_df) <- c("PCA1", "PCA2", "PCA3")
dim(pca_df)
## [1] 45 3
Reducing data gives possibility to plot the on 3d graph.
scatterplot3d(pca_df$PCA1, pca_df$PCA2, pca_df$PCA3,
color = "blue", pch = 16, main = "PCA 3D")
Now preanalysis suggests 2 clusters.
fviz_nbclust(pca_df, FUNcluster=pam, method="gap_stat")
fviz_nbclust(pca_df, FUNcluster=pam, method="silhouette")
Results are worse than on the raw data. PCA helped to visualize data but has worsen clustering evaluation.
pam_pca_res<-eclust(pca_df, "pam", hc_metric="euclidian",k=2)
fviz_silhouette(pam_pca_res)
3d visualization of clusters:
cluster | Wind_Power_Onshore | Wind_Speed_100m | Wind_Speed_10m | Total_Precipitation | Solar_PV_Power | Mean_Sea_Level_Pressure | Global_Horizontal_Irradiance | Electricity_Demand | Air_Temperature |
---|---|---|---|---|---|---|---|---|---|
1 | 9730571 | 6.27 | 3.72 | 0.78 | 1244930 | 101545.1 | 117.77 | 149568212 | 281.62 |
2 | 8681392 | 5.94 | 3.51 | 0.70 | 1321682 | 101641.2 | 124.38 | 149622252 | 281.97 |
## [1] "Medoids:"
## PCA1 PCA2 PCA3
## 1995 1.064503 0.1124647 -0.02568238
## 2005 -2.818484 -0.3508842 0.33036523
PCA allowed a reduction of variables and a better interpretation of the data, but did not improve the clustering results. The raw data were more ‘clusterable’.
It is possible to find representative years and group them by climate data, but I feel that what I have done is somehow unnecessary. It makes more sense on a larger dataset. One could add the same factors like general happiness, energy price, etc. or more technical data on grid performance.
I checked if the medoids are historically important, like if the “zima stulecia” (engl. winter of the century) or some other big catastrophe was in those years, but did not come to any conclusions.