I will analyze the secondary housing market in Warsaw using K-means and CLARA. Clustering is a useful data analysis technique that helps identify groups, which can be further explored.

https://www.kaggle.com/datasets/oleksandrarsentiev/warsaw-pl-flat-prices-sept-2022 - The database contains information about apartment prices, price per square meter, the number of rooms, apartment size in square meters, and their location. We removed the information about location and apartment prices (to avoid redundancy, as we already have the price per square meter and the apartment size).

dane = read.csv(file = "Warsaw_flat_prices_25_Sep_22.csv", header = TRUE,sep=",")
dane = unique(dane)
dane$Price = sub(" zł", "",dane$Price)
dane$Price = as.numeric(gsub(" ", "",dane$Price))
dane <- na.omit(dane) # foreign currency
dane$Location = sub("Warszawa, ", "",dane$Location)
dane$Location = gsub("\\,.*", "", dane$Location)
dane$Price.per.m2 = sub(" zł/m²", "",dane$Price.per.m2)
dane$Price.per.m2 = as.numeric(gsub(" ", "",dane$Price.per.m2))
dane$Size.M2= sub(" m²", "",dane$Size.M2)
dane$Size.M2 = as.numeric(gsub(" ", "",dane$Size.M2))
dane$Rooms= sub("\\ .*", "",dane$Rooms)
dane$Rooms = as.numeric(gsub(" ", "",dane$Rooms))
dane <- na.omit(dane) # lack of information about Rooms (2 flats)
dane <- dane[!(dane$Price.per.m2==1),] # outlier(data error: 660000 m^2 and 1zł/m^2)
summary(dane[2:5])
##      Price           Price.per.m2       Size.M2           Rooms       
##  Min.   :    5900   Min.   :    76   Min.   : 10.00   Min.   : 1.000  
##  1st Qu.:  550000   1st Qu.: 11343   1st Qu.: 43.00   1st Qu.: 2.000  
##  Median :  695000   Median : 13333   Median : 54.00   Median : 3.000  
##  Mean   :  950403   Mean   : 14441   Mean   : 62.72   Mean   : 2.631  
##  3rd Qu.:  957000   3rd Qu.: 16138   3rd Qu.: 69.98   3rd Qu.: 3.000  
##  Max.   :21600000   Max.   :148889   Max.   :560.00   Max.   :10.000
hist(log(dane$Rooms))

hist(log(dane$Size.M2))

hist(log(dane$Price.per.m2))

As shown in the histograms, normalization of the variables will be necessary. We will examine all combinations of three variables, which gives us four variants.

We will use the K-means and CLARA algorithms for the analysis, with the latter chosen because PAM proved infeasible to compute due to the large number of observations. We will then compare the results from both methods.

k-means

First, we need to determine how many clusters we require. For this, the Silhouette function will be used.

dps <- dane[3:4]
dpsr <- dane[3:5]
dpr <- dpsr[-2]
dsr <- dpsr[-1]
d_p_s <- as.data.frame(lapply(dps, scale))
d_p_s_r <- as.data.frame(lapply(dpsr, scale))
d_p_r <- as.data.frame(lapply(dpr, scale))
d_s_r <- as.data.frame(lapply(dsr, scale))
km1s <- fviz_nbclust(d_p_s, kmeans, method = "s") + ggtitle("Price (m2) & Size (m2)")
km2s <- fviz_nbclust(d_p_s_r, kmeans, method = "s") + ggtitle("Price (m2), Size (m2) & Rooms")
km3s <- fviz_nbclust(d_p_r, kmeans,k.max = 10, method = "s") + ggtitle("Price (m2) & Rooms")
km4s <- fviz_nbclust(d_s_r, kmeans, method = "s") + ggtitle("Size (m2) & Rooms")
grid.arrange(km1s, km2s, km3s, km4s, ncol=2, top = "Optimal number of clusters - kmeans")

As seen in the charts, the most appropriate number of clusters varies between the clustered datasets. An analysis will be conducted for each of them, and an alternative variant with a different number of clusters will be created for each dataset.

k1 <- kmeans(d_p_s, 2)
k2 <- kmeans(d_p_s, 4)
k3 <- kmeans(d_p_r, 10)
k4 <- kmeans(d_p_r, 3)
k5 <- kmeans(d_s_r, 5)
k6 <- kmeans(d_s_r, 2)
k7 <- kmeans(d_p_s_r, 2)
k8 <- kmeans(d_p_s_r, 4)

k_1 <- fviz_cluster(list(data=dps, cluster=k1$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
k_2 <- fviz_cluster(list(data=dps, cluster=k2$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
k_3 <- fviz_cluster(list(data=dpr, cluster=k3$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")
k_4 <- fviz_cluster(list(data=dpr, cluster=k4$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")

k_5 <- fviz_cluster(list(data=dsr, cluster=k5$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
k_6 <- fviz_cluster(list(data=dsr, cluster=k6$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
k_7 <- fviz_cluster(list(data=dsr, cluster=k7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_8 <- fviz_cluster(list(data=dsr, cluster=k8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")

k_9 <- fviz_cluster(list(data=dps, cluster=k7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_10 <- fviz_cluster(list(data=dps, cluster=k8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_11 <- fviz_cluster(list(data=dpr, cluster=k7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_12 <- fviz_cluster(list(data=dpr, cluster=k8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")

grid.arrange(arrangeGrob(k_1, k_2, k_3, k_4, ncol=2, top = "K-means"))

grid.arrange(arrangeGrob(k_5, k_6, k_7, k_8, ncol=2, top = "K-means"))

grid.arrange(arrangeGrob(k_9, k_10, k_11, k_12, ncol=2, top = "K-means"))

The results are interesting. In the case of clustering based on size (M²) and price (M²), the algorithm divided apartments into smaller and cheaper versus larger and more expensive ones. In the alternative variant with 4 clusters, there is a similar division. “Cheap” and “expensive” refer to the price per square meter; for simplicity, I will use the term “cheap” if the price per square meter is low and “expensive” otherwise.

When clustering by the number of rooms and price, having more clusters may not be better, as the 10-cluster variant is difficult to interpret. For three clusters, a clear division emerges: studio apartments (1-2 rooms, cheap), cheaper apartments with more rooms, and more expensive apartments with any number of rooms.

For clusters based on the number of rooms and apartment size, there is a division into smaller apartments with fewer rooms and larger apartments with more rooms.

It is challenging to interpret three-dimensional clusters, but it is evident from the analysis of clusters that apartments with the same number of rooms can vary in housing standards, which are associated with their size.

Let’s now apply the same analytical methods to CLARA.

CLARA

clara1s <- fviz_nbclust(d_p_s, clara,k.max = 10, method = "s") + ggtitle("Price (m2) & Size (m2)")
clara2s <- fviz_nbclust(d_p_s_r, clara,k.max = 10, method = "s") + ggtitle("Price (m2), Size (m2) & Rooms")
clara3s <- fviz_nbclust(d_p_r, clara,k.max = 10, method = "s") + ggtitle("Price (m2) & Rooms")
clara4s <- fviz_nbclust(d_s_r, clara,k.max = 10, method = "s") + ggtitle("Size (m2) & Rooms")
grid.arrange(clara1s, clara2s, clara3s, clara4s, ncol=2, top = "Optimal number of clusters - clara")

In two cases, the most appropriate number of clusters differs compared to K-means. As with K-means, an analysis will be conducted for the determined number of clusters, as well as for the alternative variants.

c1 <- clara(d_p_s, 2, samples=1000)
c2 <- clara(d_p_s, 3, samples=1000)
c3 <- clara(d_p_r, 9, samples=1000)
c4 <- clara(d_p_r, 3, samples=1000)
c5 <- clara(d_s_r, 5, samples=1000)
c6 <- clara(d_s_r, 3, samples=1000)
c7 <- clara(d_p_s_r, 3, samples=1000)
c8 <- clara(d_p_s_r, 2, samples=1000)

c_1 <- fviz_cluster(list(data=dps, cluster=c1$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
c_2 <- fviz_cluster(list(data=dps, cluster=c2$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
c_3 <- fviz_cluster(list(data=dpr, cluster=c3$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")
c_4 <- fviz_cluster(list(data=dpr, cluster=c4$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")

c_5 <- fviz_cluster(list(data=dsr, cluster=c5$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
c_6 <- fviz_cluster(list(data=dsr, cluster=c6$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
c_7 <- fviz_cluster(list(data=dsr, cluster=c7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_8 <- fviz_cluster(list(data=dsr, cluster=c8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")

c_9 <- fviz_cluster(list(data=dps, cluster=c7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_10 <- fviz_cluster(list(data=dps, cluster=c8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_11 <- fviz_cluster(list(data=dpr, cluster=c7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_12 <- fviz_cluster(list(data=dpr, cluster=c8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")

grid.arrange(arrangeGrob(c_1, c_2, c_3, c_4, ncol=2, top = "CLARA"))

grid.arrange(arrangeGrob(c_5, c_6, c_7, c_8, ncol=2, top = "CLARA"))

grid.arrange(arrangeGrob(c_9, c_10, c_11, c_12, ncol=2, top = "CLARA"))

In the 2-cluster variant of CLARA for Price (m²) & Size (m²), the division is solely based on price, while in the 3-cluster variant, it divides similarly to the 2-cluster variant in K-means, but with two clusters for smaller ones dependent on price.

The other two variable clusters, outside Price (m2) & Rooms with 9 clusters, are similar to those from K-means in terms of how the data is divided. For three-dimensional clusters, the conclusion remains the same as with K-means, although the clusters differ, albeit not drastically, from those in K-means.

Post-diagnostics

Now, let’s use the Calinski-Harabasz index to determine which divisions are better than others. The higher the value of this index for the same data, the better the division into clusters. We will check this for all analyses to see if the pre-diagnostics identified the optimal number of clusters.

k <- matrix(data = NA, nrow = 4, ncol = 4, byrow = FALSE,
       dimnames = NULL)
rownames(k) <- c("k-mean", "k-means alt", "clara", "clara alt")
colnames(k) <- c("Price (m2) & Size (m2)", "Price (m2) & Rooms","Size (m2) & Rooms", "Price, Size & Rooms")

k[1,1] = round(calinhara(d_p_s, k1$cluster),digits=2)
k[2,1] = round(calinhara(d_p_s, k2$cluster),digits=2)
k[1,2] = round(calinhara(d_p_s_r, k3$cluster),digits=2)
k[2,2] = round(calinhara(d_p_s_r, k4$cluster),digits=2)
k[1,3] = round(calinhara(d_p_r, k5$cluster),digits=2)
k[2,3] = round(calinhara(d_p_r, k6$cluster),digits=2)
k[1,4] = round(calinhara(d_p_r, k7$cluster),digits=2)
k[2,4] = round(calinhara(d_p_r, k8$cluster),digits=2)

k[3,1] = round(calinhara(d_p_s, c1$cluster),digits=2)
k[4,1] = round(calinhara(d_p_s, c2$cluster),digits=2)
k[3,2] = round(calinhara(d_p_s_r, c3$cluster),digits=2)
k[4,2] = round(calinhara(d_p_s_r, c4$cluster),digits=2)
k[3,3] = round(calinhara(d_p_r, c5$cluster),digits=2)
k[4,3] = round(calinhara(d_p_r, c6$cluster),digits=2)
k[3,4] = round(calinhara(d_p_r, c7$cluster),digits=2)
k[4,4] = round(calinhara(d_p_r, c8$cluster),digits=2)

k
##             Price (m2) & Size (m2) Price (m2) & Rooms Size (m2) & Rooms
## k-mean                     5752.96            3709.20           2451.82
## k-means alt                5322.65            4197.83           3694.40
## clara                      3166.78            2683.28           2538.24
## clara alt                  4281.90            3749.88           3744.56
##             Price, Size & Rooms
## k-mean                  4842.48
## k-means alt             4718.17
## clara                   4446.24
## clara alt               4844.03

In the first case, K-means with the default number of clusters had the best fit. For clustering based on three variables, K-means and CLARA with 2 clusters turned out to be more accurate. In the second case, k-means with a smaller number of clusters was better than with default number, based on statistics. For Size (m2) & Rooms, alternative variants have better results.

This shows that the assumptions of the algorithms matter: K-means selects the best possible points, while CLARA and PAM use the best existing points, which influences the analysis outcomes, but a number of clusters are important too.

Summary

An analysis was conducted to segment the secondary housing market in Warsaw into different groups using K-means and CLARA, which provided valuable insights into the market. Different clustering methods are highly useful for analyzing segmentation in various contexts, as they can help conclude the subject being studied. Although it was challenging to distinguish strongly defined segments within the Warsaw housing market, the analysis suggests that factors such as price per square meter, apartment size, and the number of rooms play a significant role in determining housing standards.