I will analyze the secondary housing market in Warsaw using K-means and CLARA. Clustering is a useful data analysis technique that helps identify groups, which can be further explored.
https://www.kaggle.com/datasets/oleksandrarsentiev/warsaw-pl-flat-prices-sept-2022 - The database contains information about apartment prices, price per square meter, the number of rooms, apartment size in square meters, and their location. We removed the information about location and apartment prices (to avoid redundancy, as we already have the price per square meter and the apartment size).
dane = read.csv(file = "Warsaw_flat_prices_25_Sep_22.csv", header = TRUE,sep=",")
dane = unique(dane)
dane$Price = sub(" zł", "",dane$Price)
dane$Price = as.numeric(gsub(" ", "",dane$Price))
dane <- na.omit(dane) # foreign currency
dane$Location = sub("Warszawa, ", "",dane$Location)
dane$Location = gsub("\\,.*", "", dane$Location)
dane$Price.per.m2 = sub(" zł/m²", "",dane$Price.per.m2)
dane$Price.per.m2 = as.numeric(gsub(" ", "",dane$Price.per.m2))
dane$Size.M2= sub(" m²", "",dane$Size.M2)
dane$Size.M2 = as.numeric(gsub(" ", "",dane$Size.M2))
dane$Rooms= sub("\\ .*", "",dane$Rooms)
dane$Rooms = as.numeric(gsub(" ", "",dane$Rooms))
dane <- na.omit(dane) # lack of information about Rooms (2 flats)
dane <- dane[!(dane$Price.per.m2==1),] # outlier(data error: 660000 m^2 and 1zł/m^2)
summary(dane[2:5])
## Price Price.per.m2 Size.M2 Rooms
## Min. : 5900 Min. : 76 Min. : 10.00 Min. : 1.000
## 1st Qu.: 550000 1st Qu.: 11343 1st Qu.: 43.00 1st Qu.: 2.000
## Median : 695000 Median : 13333 Median : 54.00 Median : 3.000
## Mean : 950403 Mean : 14441 Mean : 62.72 Mean : 2.631
## 3rd Qu.: 957000 3rd Qu.: 16138 3rd Qu.: 69.98 3rd Qu.: 3.000
## Max. :21600000 Max. :148889 Max. :560.00 Max. :10.000
hist(log(dane$Rooms))
hist(log(dane$Size.M2))
hist(log(dane$Price.per.m2))
As shown in the histograms, normalization of the variables will be necessary. We will examine all combinations of three variables, which gives us four variants.
We will use the K-means and CLARA algorithms for the analysis, with the latter chosen because PAM proved infeasible to compute due to the large number of observations. We will then compare the results from both methods.
First, we need to determine how many clusters we require. For this, the Silhouette function will be used.
dps <- dane[3:4]
dpsr <- dane[3:5]
dpr <- dpsr[-2]
dsr <- dpsr[-1]
d_p_s <- as.data.frame(lapply(dps, scale))
d_p_s_r <- as.data.frame(lapply(dpsr, scale))
d_p_r <- as.data.frame(lapply(dpr, scale))
d_s_r <- as.data.frame(lapply(dsr, scale))
km1s <- fviz_nbclust(d_p_s, kmeans, method = "s") + ggtitle("Price (m2) & Size (m2)")
km2s <- fviz_nbclust(d_p_s_r, kmeans, method = "s") + ggtitle("Price (m2), Size (m2) & Rooms")
km3s <- fviz_nbclust(d_p_r, kmeans,k.max = 10, method = "s") + ggtitle("Price (m2) & Rooms")
km4s <- fviz_nbclust(d_s_r, kmeans, method = "s") + ggtitle("Size (m2) & Rooms")
grid.arrange(km1s, km2s, km3s, km4s, ncol=2, top = "Optimal number of clusters - kmeans")
As seen in the charts, the most appropriate number of clusters varies between the clustered datasets. An analysis will be conducted for each of them, and an alternative variant with a different number of clusters will be created for each dataset.
k1 <- kmeans(d_p_s, 2)
k2 <- kmeans(d_p_s, 4)
k3 <- kmeans(d_p_r, 10)
k4 <- kmeans(d_p_r, 3)
k5 <- kmeans(d_s_r, 5)
k6 <- kmeans(d_s_r, 2)
k7 <- kmeans(d_p_s_r, 2)
k8 <- kmeans(d_p_s_r, 4)
k_1 <- fviz_cluster(list(data=dps, cluster=k1$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
k_2 <- fviz_cluster(list(data=dps, cluster=k2$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
k_3 <- fviz_cluster(list(data=dpr, cluster=k3$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")
k_4 <- fviz_cluster(list(data=dpr, cluster=k4$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")
k_5 <- fviz_cluster(list(data=dsr, cluster=k5$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
k_6 <- fviz_cluster(list(data=dsr, cluster=k6$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
k_7 <- fviz_cluster(list(data=dsr, cluster=k7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_8 <- fviz_cluster(list(data=dsr, cluster=k8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_9 <- fviz_cluster(list(data=dps, cluster=k7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_10 <- fviz_cluster(list(data=dps, cluster=k8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_11 <- fviz_cluster(list(data=dpr, cluster=k7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
k_12 <- fviz_cluster(list(data=dpr, cluster=k8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
grid.arrange(arrangeGrob(k_1, k_2, k_3, k_4, ncol=2, top = "K-means"))
grid.arrange(arrangeGrob(k_5, k_6, k_7, k_8, ncol=2, top = "K-means"))
grid.arrange(arrangeGrob(k_9, k_10, k_11, k_12, ncol=2, top = "K-means"))
The results are interesting. In the case of clustering based on size (M²) and price (M²), the algorithm divided apartments into smaller and cheaper versus larger and more expensive ones. In the alternative variant with 4 clusters, there is a similar division. “Cheap” and “expensive” refer to the price per square meter; for simplicity, I will use the term “cheap” if the price per square meter is low and “expensive” otherwise.
When clustering by the number of rooms and price, having more clusters may not be better, as the 10-cluster variant is difficult to interpret. For three clusters, a clear division emerges: studio apartments (1-2 rooms, cheap), cheaper apartments with more rooms, and more expensive apartments with any number of rooms.
For clusters based on the number of rooms and apartment size, there is a division into smaller apartments with fewer rooms and larger apartments with more rooms.
It is challenging to interpret three-dimensional clusters, but it is evident from the analysis of clusters that apartments with the same number of rooms can vary in housing standards, which are associated with their size.
Let’s now apply the same analytical methods to CLARA.
clara1s <- fviz_nbclust(d_p_s, clara,k.max = 10, method = "s") + ggtitle("Price (m2) & Size (m2)")
clara2s <- fviz_nbclust(d_p_s_r, clara,k.max = 10, method = "s") + ggtitle("Price (m2), Size (m2) & Rooms")
clara3s <- fviz_nbclust(d_p_r, clara,k.max = 10, method = "s") + ggtitle("Price (m2) & Rooms")
clara4s <- fviz_nbclust(d_s_r, clara,k.max = 10, method = "s") + ggtitle("Size (m2) & Rooms")
grid.arrange(clara1s, clara2s, clara3s, clara4s, ncol=2, top = "Optimal number of clusters - clara")
In two cases, the most appropriate number of clusters differs compared to K-means. As with K-means, an analysis will be conducted for the determined number of clusters, as well as for the alternative variants.
c1 <- clara(d_p_s, 2, samples=1000)
c2 <- clara(d_p_s, 3, samples=1000)
c3 <- clara(d_p_r, 9, samples=1000)
c4 <- clara(d_p_r, 3, samples=1000)
c5 <- clara(d_s_r, 5, samples=1000)
c6 <- clara(d_s_r, 3, samples=1000)
c7 <- clara(d_p_s_r, 3, samples=1000)
c8 <- clara(d_p_s_r, 2, samples=1000)
c_1 <- fviz_cluster(list(data=dps, cluster=c1$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
c_2 <- fviz_cluster(list(data=dps, cluster=c2$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Size (m2)")
c_3 <- fviz_cluster(list(data=dpr, cluster=c3$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")
c_4 <- fviz_cluster(list(data=dpr, cluster=c4$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2) & Rooms")
c_5 <- fviz_cluster(list(data=dsr, cluster=c5$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
c_6 <- fviz_cluster(list(data=dsr, cluster=c6$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Size (m2) & Rooms")
c_7 <- fviz_cluster(list(data=dsr, cluster=c7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_8 <- fviz_cluster(list(data=dsr, cluster=c8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_9 <- fviz_cluster(list(data=dps, cluster=c7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_10 <- fviz_cluster(list(data=dps, cluster=c8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_11 <- fviz_cluster(list(data=dpr, cluster=c7$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
c_12 <- fviz_cluster(list(data=dpr, cluster=c8$cluster), pointsize = 0.7, geom = "point", stand=F ) + ggtitle("Price (m2), Size (m2) & Rooms")
grid.arrange(arrangeGrob(c_1, c_2, c_3, c_4, ncol=2, top = "CLARA"))
grid.arrange(arrangeGrob(c_5, c_6, c_7, c_8, ncol=2, top = "CLARA"))
grid.arrange(arrangeGrob(c_9, c_10, c_11, c_12, ncol=2, top = "CLARA"))
In the 2-cluster variant of CLARA for Price (m²) & Size (m²), the division is solely based on price, while in the 3-cluster variant, it divides similarly to the 2-cluster variant in K-means, but with two clusters for smaller ones dependent on price.
The other two variable clusters, outside Price (m2) & Rooms with 9 clusters, are similar to those from K-means in terms of how the data is divided. For three-dimensional clusters, the conclusion remains the same as with K-means, although the clusters differ, albeit not drastically, from those in K-means.
Now, let’s use the Calinski-Harabasz index to determine which divisions are better than others. The higher the value of this index for the same data, the better the division into clusters. We will check this for all analyses to see if the pre-diagnostics identified the optimal number of clusters.
k <- matrix(data = NA, nrow = 4, ncol = 4, byrow = FALSE,
dimnames = NULL)
rownames(k) <- c("k-mean", "k-means alt", "clara", "clara alt")
colnames(k) <- c("Price (m2) & Size (m2)", "Price (m2) & Rooms","Size (m2) & Rooms", "Price, Size & Rooms")
k[1,1] = round(calinhara(d_p_s, k1$cluster),digits=2)
k[2,1] = round(calinhara(d_p_s, k2$cluster),digits=2)
k[1,2] = round(calinhara(d_p_s_r, k3$cluster),digits=2)
k[2,2] = round(calinhara(d_p_s_r, k4$cluster),digits=2)
k[1,3] = round(calinhara(d_p_r, k5$cluster),digits=2)
k[2,3] = round(calinhara(d_p_r, k6$cluster),digits=2)
k[1,4] = round(calinhara(d_p_r, k7$cluster),digits=2)
k[2,4] = round(calinhara(d_p_r, k8$cluster),digits=2)
k[3,1] = round(calinhara(d_p_s, c1$cluster),digits=2)
k[4,1] = round(calinhara(d_p_s, c2$cluster),digits=2)
k[3,2] = round(calinhara(d_p_s_r, c3$cluster),digits=2)
k[4,2] = round(calinhara(d_p_s_r, c4$cluster),digits=2)
k[3,3] = round(calinhara(d_p_r, c5$cluster),digits=2)
k[4,3] = round(calinhara(d_p_r, c6$cluster),digits=2)
k[3,4] = round(calinhara(d_p_r, c7$cluster),digits=2)
k[4,4] = round(calinhara(d_p_r, c8$cluster),digits=2)
k
## Price (m2) & Size (m2) Price (m2) & Rooms Size (m2) & Rooms
## k-mean 5752.96 3709.20 2451.82
## k-means alt 5322.65 4197.83 3694.40
## clara 3166.78 2683.28 2538.24
## clara alt 4281.90 3749.88 3744.56
## Price, Size & Rooms
## k-mean 4842.48
## k-means alt 4718.17
## clara 4446.24
## clara alt 4844.03
In the first case, K-means with the default number of clusters had the best fit. For clustering based on three variables, K-means and CLARA with 2 clusters turned out to be more accurate. In the second case, k-means with a smaller number of clusters was better than with default number, based on statistics. For Size (m2) & Rooms, alternative variants have better results.
This shows that the assumptions of the algorithms matter: K-means selects the best possible points, while CLARA and PAM use the best existing points, which influences the analysis outcomes, but a number of clusters are important too.
An analysis was conducted to segment the secondary housing market in Warsaw into different groups using K-means and CLARA, which provided valuable insights into the market. Different clustering methods are highly useful for analyzing segmentation in various contexts, as they can help conclude the subject being studied. Although it was challenging to distinguish strongly defined segments within the Warsaw housing market, the analysis suggests that factors such as price per square meter, apartment size, and the number of rooms play a significant role in determining housing standards.