University of Warsaw - Faculty of Economic Sciences

What inspires me to do such a project?

Unsupervised learning can be a strong tool not only to analyze numeral data but also to do image-based analysis which is rather rare among statistical techniques we can use. Clustering snowflakes or rather snow areas in Poland on December 31st of 2021 and January 1st of 2022 can be very interesting because clustering will tell how the data changed thus, snow surface changed during 24h. Based only on satellite pictures we can describe nature change on a massive scale

1 Data

1.1 Data used in the analysis

Data used in this analysis comes from a site https://meteologix.com/dj/model-charts/swisshd-eu/poland/snow-depth.html?fbclid=IwAR2hpTKTwoVyVilgLkV1O_3QugqqVZ6yDXq6JWn_Tp2XfY-NNiKlD16MUoc providing data based on satellite images on topics related to weather. I used two images, first from December 31st of 2021 at 12:00 and the second from January 1st of 2022 at 12:00.

Libraries

library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(jpeg)
library(rasterImage)
library(stringr)
library('plotly')
library(knitr)

2 Maps as data

2.1 December 31th

2.2 Janury 1st

But those images will be hard to cluster as we can see that not only Poland is included

Thus all spaces outside Poland will be replaced with green colour to make sure that it will contain a different cluster. This way I will not include any area that im not interested in

2.3 December 31th - adjusted

2.4 January 1st - adjusted

3 Clustering Algorithm: CLARA

3.1 Average silhouette width

n1<-c()                     # empty vector to save results
for (i in 1:10) {               # numer of clusters to consider
  c1<-clara(rgbImage0[, c("r.value", "g.value", "b.value")], i) 
  n1[i]<-c1$silinfo$avg.width       # saving silhouette to vector
}
  plot(n1, type='l', main="Optimal number of clusters", xlab="Number of clusters", ylab="Average silhouette", col="blue")
  abline(h=(1:30)*5/100, lty=3, col="grey50")
  abline(v = 5, col='darkgreen')

So for December 31st, it’s optimal to choose 5 clusters of colors which is pretty logical as we can see many shades of snow representing the different volumes of it. One color is devoted for the ‘outside green’

  n1<-c()                       # empty vector to save results
  for (i in 1:10) {             # numer of clusters to consider
    c1<-clara(rgbImage1[, c("r.value", "g.value", "b.value")], i)   
    n1[i]<-c1$silinfo$avg.width     # saving silhouette to vector
  }
  plot(n1, type='l', main="Optimal number of clusters", xlab="Number of clusters", ylab="Average silhouette", col="blue")
  abline(h=(1:30)*5/100, lty=3, col="grey50")
  abline(v = 3, col='darkgreen')

As it’s visible new year in Poland started with a warm morning because there wasn’t so much snow. Therefore, we can distinguish only 3 clusters so 3 colours

4 Alternative map - clusters for colours

4.1 December 31st

clara1<-clara(rgbImage0[,3:5], 5) 
colours0<-rgb(clara1$medoids[clara1$clustering, ])
plot(rgbImage0$y~rgbImage0$x, col=colours0, pch=".", cex=2, asp=1, main="5 colours December 31st")

4.2 January 1st

clara2<-clara(rgbImage1[,3:5], 3) 
colours1<-rgb(clara2$medoids[clara2$clustering, ])
plot(rgbImage1$y~rgbImage1$x, col=colours1, pch=".", cex=2, asp=1, main="3 colours January 1st")

5 Colours palette

5.1 Unique Colours

December 31st

rgb_code1 colour_name1
#23B14D green
#AAA9C9 faded blue
#F8F8FA white
#3C92E7 blue
#75BAFF purple

January 1st

rgb_code2 colour_name2
#23B14D green
#A9A8C7 faded blue
#F5F5F7 white

5.2 One colour for snow

colours1<-rgb(clara1$medoids[clara1$clustering, ])
colours1 <- str_replace_all(colours1,"#AAA9C9","#3C92E7")
colours1 <- str_replace_all(colours1,"#75BAFF","#3C92E7")

plot(rgbImage0$y~rgbImage0$x, col=colours1, pch=".", cex=2, asp=1, main="3 different colours")

6 Conclusion: Change of snow area

Visualisation is always a nice thing to do. Beside that, the aim of this analysis is to measure how much area less (or more but we can see than rather less) is covered by snow in only 24 hours.

6.1 Visualisation

Colour Freq
#3C92E7 92.24624
#F8F8FA 7.75376
Colour1 Freq
#A9A8C7 38.09902
#F5F5F7 61.90098

It’s possible to observe that clustering help us to measure that snow area in Poland dropped from 92,2% in December 31st 12:00 to 38,1% in January 1st 12:00. It’s only 24 hours but the effect is very surprising! Well, definitely a warm start of 2022!