Introduction

The case of this paper is to divide the properties into clusters. Using clustering methods we obtain groups of similar objects which as expected will be dissimilar between groups. Clustering assists in market segmentation, dividing the overall property market into distinct groups. This can be valuable for targeted marketing strategies, as different clusters may have varying needs, preferences, and price sensitivities.On the other hand, clustering can be used for assessing risk. Properties within the same cluster are likely to share similar risks and vulnerabilities. This information is valuable for investors, allowing them to diversify their portfolios and manage risk more effectively. Finally, this allows for a deeper understanding of the competitive landscape, helping property professionals make informed decisions about pricing, marketing, and overall strategy.

Methodology

The medoid of a cluster is defined as the object in the cluster whose sum (and, equivalently, the average) of dissimilarities to all the objects in the cluster is minimal, that is, it is a most centrally located point in the cluster.

The centroid of a cluster is defined as mean point in the cluster - represents center of the cluster. It might nooot be a member of the dataset

K-means clustering is popular unsupervised machine learning algorithm. Aim of the clustering is to partition n observations into k clusters. To perform this algorithm we have to follow steps below:
- Specify number of centroids
- Assign each observation to the cluster with the nearest mean
- Recalculate centroids for observations assigned to each cluster
The algorithm has converged when the assignments no longer change. It is important to point out that the algorithm is not guaranteed to find the optimum. The algorithm uses often distance to assign observation to cluster. Using a different distance function other than (squared) Euclidean (i.e. Manhattan, Minkowski, Hamming, …) distance may prevent the algorithm from converging.

PAM (Partitioning Around Medoids) is a clustering unsupervised machine learning algorithm similar to k-means. In contrast to the k-means algorithm, k-medoids chooses medoids as centers (in K-means we use centroids), and thereby allows for better interpretation of the cluster centers than in k-means where the center of a cluster is not necessarily one of the input data points. Main disadvantage of this algorithm is that number of clusters must be known a priori. The perform this algorithm we have to follow steps below: - Specify number of medoids
- Associate each observation to the closest medoid
- For each medoid consider the swap of medoid to nonmedoid. If the change causes decrease of within sum of squares, change the points.
- Reassign each point to the cluster defined by the closest medoid determined in the previous step
- Repeat two steps above until there is no possibility to change medoid with nonmedoid point and get lower within sum of squares.

Hierarchical clustering It is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:
- Agglomerative (bottom-up) - each observation starts as its own cluster and pairs of the most similar points are merged until there is one cluster containing all observations from initial data set.
- Divisive (top-down) - reverse to agglomerative as it starts with one cluster containing all initial observations and ends with each observation as its own cluster
Results of hierarchical clustering are usually present in a dendrogram.

Silhouette width
The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be “strong”, a value over 0.5 “reasonable” and over 0.25 “weak”. Negative value of silhouette means that the clusters are not proper and it is high probability that the observation should be inside of other cluster. With increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar.

Duda-Hart index is a test for quality measure for the clustering. It is well defined for kmeans class and its hypotheses are as follows:
- \(H_0\): homogeneity of cluster (data within cluster as similar)
- \(H_1\): heterogeneity of cluster (one can easily split the cluster)
It calculates statistics ratio of within-cluster sum of squares for two clusters dh) overall sum of squares.
Verification of hypotheses: cluster1=FALSE (\(H_0\) of homogeneity rejected, accept \(H_1\))

Calinski-Harabasz index is a metric for evaluating clustering algorithms. Its construction is a fraction with between-group sum of squares in the counter and within-cluster sum of squares (sum of the within-cluster dispersion for all clusters) in the nominator. Based on the formula we can see that the higher statistics the better as it means that between-group sum of squares is higher or within-cluster sum of squares is lower (or both). The statistic is usually used for comparing solutions for alternative number of clusters rather that different algorithms.

Linear regression
Linear regression is used to predict the value of an outcome variable \(Y\) based on one or more input predictor variables \(X\). The aim is to establish a linear relationship between the dependent variable and the independent variables, so that, we can use this formula to estimate the value of the response \(Y\), when only the predictors \(X\) values are known. In the simplest model with one variable the mathematical formula looks as presented: \(Y = β_1 + β_2X + ϵ\) where, \(β_1\) is the intercept and \(β_2\) is the slope. Collectively, they are called regression coefficients. ϵ is the error term, the part of Y the regression model is unable to explain.

Data set

The data set used in this paper is sourced from the morizon.pl website using webscraping method. This study focuses on modeling apartment prices in Warsaw that an average person could afford. Therefore, certain constraints were imposed on the dataset while webscraping the data. The acquired data satisfies the following conditions:
- The price per square meter of the apartment is not greater than 15,000 PLN/\(m^2\)
- The area of the apartment does not exceed 100 \(m^2\)
Number of observations: 2111
Number of variables: 11
Number of numeric variables: 9

List of attributes with explanation

Variable name	Variable explanation
Location	location of the property (1 out of 16 districts)
Market	aftermarket/primary market
Price_m2	price of \(m^2\) in zl
Surface	surface of the property in m2
Rooms	number of rooms
If_Balcony	1 means that property has balcony, 0 otherwise
Level	on which floor the property is situated
Built_year	year of built
If_elevator	1 means that property has access to elevator, 0 otherwise
If_parking	1 means that property has access to individual parking place, 0 otherwise
if_gated_community	1 means that property is situated on gated community, 0 otherwise

List of packages necessary for proceeding the code

set.seed(123)
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(flexclust)

## Warning: package 'flexclust' was built under R version 4.3.2

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4

library(fpc)

## Warning: package 'fpc' was built under R version 4.3.2

library(clustertend)

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

library(cluster)

## Warning: package 'cluster' was built under R version 4.3.2

library(ClusterR)

## Warning: package 'ClusterR' was built under R version 4.3.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'forcats' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dendextend)

## Warning: package 'dendextend' was built under R version 4.3.2

## 
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## 
## Attaching package: 'dendextend'
## 
## The following object is masked from 'package:stats':
## 
##     cutree

library(readxl)
library(ggplot2)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

library(dendextend)
library(dplyr)
library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Clustering preparation and EDA

properties <- read_xlsx("C:/Users/User/Desktop/Studia/Praca licencjacka z ekonomii/Licencjat Ekonomia/Properties.xlsx")
# Check first values of every row
glimpse(properties)

## Rows: 2,111
## Columns: 11
## $ Location           <chr> "Ursynow", "srodmiescie", "Bielany", "srodmiescie",…
## $ Market             <chr> "aftermarket", "aftermarket", "aftermarket", "after…
## $ Price_m2           <dbl> 15000.00, 15000.00, 15000.00, 15000.00, 15000.00, 1…
## $ Surface            <dbl> 33.00, 34.68, 46.00, 50.00, 51.60, 54.00, 59.00, 60…
## $ Rooms              <chr> "1", "1", "2", "3", "2", "3", "3", "2", "3", "2", "…
## $ If_Balcony         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, …
## $ Level              <dbl> 0, 3, 6, 1, 2, 0, 5, 6, 2, 0, 5, 1, 1, 2, 1, 0, 2, …
## $ Built_year         <chr> "2020", "1959", "2008", "1952", "2022", "1960", "19…
## $ If_elevator        <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ If_parking         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ if_gated_community <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

properties$Location<-str_to_title(properties$Location)
# Variables Rooms and Built_year should be numeric
str(properties)

## tibble [2,111 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Location          : chr [1:2111] "Ursynow" "Srodmiescie" "Bielany" "Srodmiescie" ...
##  $ Market            : chr [1:2111] "aftermarket" "aftermarket" "aftermarket" "aftermarket" ...
##  $ Price_m2          : num [1:2111] 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 ...
##  $ Surface           : num [1:2111] 33 34.7 46 50 51.6 ...
##  $ Rooms             : chr [1:2111] "1" "1" "2" "3" ...
##  $ If_Balcony        : num [1:2111] 0 0 1 0 0 0 0 0 0 0 ...
##  $ Level             : num [1:2111] 0 3 6 1 2 0 5 6 2 0 ...
##  $ Built_year        : chr [1:2111] "2020" "1959" "2008" "1952" ...
##  $ If_elevator       : num [1:2111] 0 0 0 0 0 0 1 1 0 0 ...
##  $ If_parking        : num [1:2111] 0 0 0 0 0 0 0 1 0 0 ...
##  $ if_gated_community: num [1:2111] 0 0 0 0 0 0 0 0 0 0 ...

# Converting Variables Rooms and Built_year to numeric
properties$Rooms <- as.numeric(properties$Rooms)
properties$Built_year <- as.numeric(properties$Built_year)
# There are no missing values
properties[!complete.cases(properties), ]

## # A tibble: 0 × 11
## # ℹ 11 variables: Location <chr>, Market <chr>, Price_m2 <dbl>, Surface <dbl>,
## #   Rooms <dbl>, If_Balcony <dbl>, Level <dbl>, Built_year <dbl>,
## #   If_elevator <dbl>, If_parking <dbl>, if_gated_community <dbl>

Exploratory data analysis

aggregation <- aggregate(properties$Price_m2 ~ properties$Rooms, FUN = median)
colnames(aggregation) <- c("Number of rooms", "Median of 1m2 price")
aggregation

##   Number of rooms Median of 1m2 price
## 1               1            12987.78
## 2               2            12601.94
## 3               3            11661.81
## 4               4            10846.34
## 5               5            10791.33

# Overview statistics of the data
summary(properties)

##    Location            Market             Price_m2        Surface     
##  Length:2111        Length:2111        Min.   : 5583   Min.   :13.83  
##  Class :character   Class :character   1st Qu.:10595   1st Qu.:46.00  
##  Mode  :character   Mode  :character   Median :11965   Median :56.00  
##                                        Mean   :11924   Mean   :57.13  
##                                        3rd Qu.:13386   3rd Qu.:67.40  
##                                        Max.   :15000   Max.   :99.70  
##      Rooms         If_Balcony         Level          Built_year  
##  Min.   :1.000   Min.   :0.0000   Min.   :-1.000   Min.   :1891  
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.: 1.000   1st Qu.:1975  
##  Median :3.000   Median :1.0000   Median : 2.000   Median :2000  
##  Mean   :2.666   Mean   :0.5623   Mean   : 2.793   Mean   :1994  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.: 4.000   3rd Qu.:2018  
##  Max.   :5.000   Max.   :1.0000   Max.   :18.000   Max.   :2025  
##   If_elevator       If_parking      if_gated_community
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   
##  Median :0.0000   Median :0.00000   Median :0.00000   
##  Mean   :0.2729   Mean   :0.08716   Mean   :0.06585   
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000

# There is much greatest share of aftermarket properties in the data set
table(properties$Market)

## 
##    aftermarket primary_market 
##           1840            271

table(properties$Location, properties$Market)

##                
##                 aftermarket primary_market
##   Bemowo                169             28
##   Bialolęka             197             28
##   Bielany               132             43
##   Mokotow                72              1
##   Ochota                 97              0
##   Pragapolnoc            64              1
##   Pragapoludnie         163              8
##   Rembertow              58             24
##   Srodmiescie            38              0
##   Targowek              118             22
##   Ursus                 133             16
##   Ursynow               190              3
##   Wawer                 115             39
##   Wesola                 44             14
##   Wilanow                57             11
##   Wlochy                 33             22
##   Wola                   86             11
##   Zoliborz               74              0

# Number of properties in the dataset for different locations in division to Market
ggplot(data = properties, aes(y = Location, fill = Market)) +
  geom_bar() +
  ggtitle("Number of properties in the dataset for different locations in division to market type") +
  xlab("Count") +
  theme(plot.title = element_text(hjust = 0.5))

Preparation for clustering

# Subseting all numeric variables
properties_numeric <- properties[,3:11]
# Z-score standarization of the data
properties_numeric_z <- as.data.frame(lapply(properties_numeric, scale))

Next step will be a correlation analysis. It is important to see whether the variables are correlated. High correlation of variables might cause bias in further analysis. From this plot we observe that only Surface and Rooms are highly linearly correlated (0.79). In our case we would like to analyze data set containing all variables as it is an extension of other analysis on the same data. For future improvements of the report it might be necessary to use dimension reduction methods which will not be covered in this report.

corrplot(cor(properties_numeric), type = "upper", order = "hclust", method = "number")

Firstly, it will be important to measure cluster tendency of the data. To check clusterability of the data we use Hopkins statistics. It acts as a statistical hypothesis test whether the data is uniformly distributed (without any significant clusters) and alternative hypothesis is that the data is not uniformly distributed so contains significant clusters.

h~0	h~0.5	h~1
uniformly distributed data accept the null hypothesis there are no statistically significant clusters	data is random	data set is not uniformly distributed reject null hypothesis in favor to alternative hypothesis some clusters are visible

Based on value of Hopkins statistic, we reject null hypothesis that data is uniformly distributed in favor to alternative hypothesis that our data is not uniformly distributed and contains significant clusters.

clusterability <- get_clust_tendency(properties_numeric_z, 2, graph=TRUE, 
                                     gradient=list(low="red", mid="white", high="blue"), seed = 123)
# Checking hopkins statistics - it is pretty close to 1 so the data is definitely clusterable
clusterability$hopkins_stat

## [1] 0.9570934

clusterability$plot

Clustering

K-means

Refering to the methodology part, for KMeans clustering method we have to provide number of clusters in advance. Fortunately, there are some methods that we can use to find optimal number of clusters. In this case we will use Silhouette criterion and care about percent of variance explained by the clustering.

optimal_clusters<-Optimal_Clusters_KMeans(properties_numeric_z, max_clusters=10, 
                                          plot_clusters = TRUE, criterion = "variance_explained")

optimal_clusters1<-Optimal_Clusters_KMeans(properties_numeric_z, max_clusters=10, 
                                          plot_clusters=TRUE, criterion="silhouette")

cluster_KMeans <- eclust(properties_numeric_z, "kmeans", k = 4)

sill<-silhouette(cluster_KMeans$cluster, dist(properties_numeric_z))
fviz_silhouette(sill, print.summary = T)

##   cluster size ave.sil.width
## 1       1  718          0.17
## 2       2  239          0.11
## 3       3  763          0.22
## 4       4  391          0.14

We obtain 4 clusters with the highest average silhouette width for fourth cluster (0.19) and the lowest for second cluster (0.10). The average silhouette statistics is low (0.17) so it indicates that the object is weakly matched to its own cluster and some of the points can be matched to neighboring clusters. To validate obtained result we will check quality of clustering using Duda-Hart index.

# DUDA-HART INDEX
dh_index <- dudahart2(properties_numeric_z, cluster_KMeans$cluster)
dh_index

## $p.value
## [1] 0
## 
## $dh
## [1] 0.3382467
## 
## $compare
## [1] 0.89902
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

The output of dh_index: cluster1=FALSE. Refering to description of Duda-Hart index in methodology, we reject \(H_0\) in favor of \(H_1\). The clusters are heterogeneous - one can easily split the cluster.

We could also choose 10 clusters… Taking into consideration silhouette width we would go for 4 or 10 clusters. In terms of simplicity we would go for 4 clusters but it usually depends on the size of the data set. Formally, we can also check Calinski-Harabasz index for comparing alternative number of clusters.

cluster_KMeans10 <- eclust(properties_numeric_z, "kmeans", k = 10)

round(calinhara(properties_numeric_z, cluster_KMeans$cluster),digits=2)

## [1] 380.55

round(calinhara(properties_numeric_z, cluster_KMeans10$cluster),digits=2)

## [1] 314.82

Refering to the methodology part, the higher index the better. Comparing to 10 clusters, for clustering with 4 centroids the data points are more spread out between clusters than they are within clusters (then 4 clusters are preferred).

Summary of KMeans clustering
In reference to analysis of clustering quality, the accuracy of clustering is weak. The observations intra cluster are heterogeneous. The clusters are not representative so with high probability further analysis of the clustering with KMeans will have low accuracy.

PAM

##   
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?

## Warning: The plot can not be created for the specified number of clusters. This
## means the output data do not fit in the figure (plot) margins.

silhouette image

Based on number of variance_explained, silhouette width and simplicity of the result we chose 3 clusters.

pam<-eclust(properties_numeric_z, "pam", k=3, hc_metric="euclidean")

fviz_silhouette(pam)

##   cluster size ave.sil.width
## 1       1  750          0.15
## 2       2  476          0.03
## 3       3  885          0.21

pam$medoids

##        Price_m2    Surface      Rooms If_Balcony      Level Built_year
## [1,]  0.2104559 -0.6603025 -0.8499834 -1.1331476 -0.3047686 -0.3869399
## [2,]  0.6053150  0.1871874  0.4262008  0.8820795  0.0793308 -0.3456483
## [3,] -0.3914696  0.6304898  0.4262008  0.8820795 -0.3047686  0.3976002
##      If_elevator If_parking if_gated_community
## [1,]  -0.6124268 -0.3089336         -0.2654306
## [2,]   1.6320748 -0.3089336         -0.2654306
## [3,]  -0.6124268 -0.3089336         -0.2654306

dh_pam_index <- dudahart2(properties_numeric_z, pam$cluster)
dh_pam_index

## $p.value
## [1] 0
## 
## $dh
## [1] 0.5003245
## 
## $compare
## [1] 0.89902
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

As same as for KMeans clustering, the output of dh_index is cluster1=FALSE. Refering to description of Duda-Hart index in methodology, we reject \(H_0\) in favor of \(H_1\). The clusters are heterogeneous - one can easily split the cluster.

Summary of PAM clustering
In reference to analysis of clustering quality, the accuracy of clustering is weak. The observations intra cluster are heterogeneous. The clusters are not representative so with high probability further analysis of the clustering with PAM will have low accuracy.

Hierarchical clustering

To specify method which will be used in hierarchical clustering, we calculate agglomerative coefficients for average, single, complete and ward methods.

dissimilarity_matrix <- dist(properties_numeric_z, method = "euclidean")
# Different methods which can be used to hierarchical clustering
hclust_methods <- c( "average", "single", "complete", "ward")
names(hclust_methods) <- c( "average", "single", "complete", "ward")
# Agglomerative coefficients for the methods
ac <- function(x) {
  agnes(dissimilarity_matrix, method = x)$ac
}

map_dbl(hclust_methods, ac)

##   average    single  complete      ward 
## 0.9064463 0.8748009 0.9455979 0.9922546

Agglomerative coefficient measures the amount of clustering structure found. Values closer to 1 suggest strong clustering structure so the higher the better. Ward method obtained the highest value so it will be used for the clustering.

# Hierarchical clustering using ward method
hierarchical_clustering <- hclust(dissimilarity_matrix, method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

# Dendrogram
plot(hierarchical_clustering, cex = 0.01, hang = -1, xlab="", sub="")

In hierarchical clustering there is no need to specify number of clusters a priori. Although, it must be specified when we want to cut the dendrogram. In this case silhouette width and elbow method are performed.

# Elbow method
fviz_nbclust(properties_numeric_z, FUN = hcut, method = "wss")

# Silhouette width
fviz_nbclust(properties_numeric_z, FUN = hcut, method = "silhouette")

It is no significant drop in the plot for elbow method so we cannot choose optimal number of clusters explicitly. Luckily, average silhouette width for two clusters is significantly higher than for the others. Therefore 2 clusters are chosen.

hc <- hclust(dissimilarity_matrix, method = "ward.D2")

# Cut tree into 4 groups
hierarchical_clusters <- cutree(hc, k = 2)
# Dendrogram with division on the clusters
plot(hc, cex = 0.1)
rect.hclust(hc, k = 2, border = 2:5)

# More visualization
fviz_cluster(list(data = properties_numeric_z, cluster = hierarchical_clusters))

# Silhouette width for the clustering
fviz_silhouette(silhouette(cutree(hierarchical_clustering, 2), dissimilarity_matrix))

##   cluster size ave.sil.width
## 1       1 1832          0.36
## 2       2  279          0.12

Average silhouette width for the hierarchical clustering is 0.33. Comparing this result to KMeans and PAM algorithms, hierarchical clustering do the best - KMeans (0.17), PAM (0.15).

Obtained clusters are not homogeneous (Duda-Hart index - \(H_0\) rejected in favor of \(H_1\)) but dh value is significantly higher than for KMeans and PAM.

dh_hc_index <- dudahart2(properties_numeric_z, hierarchical_clusters)
dh_hc_index

## $p.value
## [1] 2.88658e-15
## 
## $dh
## [1] 0.8528244
## 
## $compare
## [1] 0.89902
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

Hierarchical clustering method obtained the highest silhouette value, therefore it was chosen to perform basic analysis of the clusters.

Analysis of hierarchical clustering groupings

EDA

properties_with_clusters <- properties %>%
  mutate(cluster = sapply(hierarchical_clusters, function(x) {if(x==1){"cluster_1"} else{"cluster_2"}}))

properties_with_clusters %>%
  group_by(cluster) %>%
  summarize(price_first_quartile = quantile(Price_m2, 0.25), price_median = quantile(Price_m2, 0.5), 
            price_third_quartile = quantile(Price_m2, 0.75))

## # A tibble: 2 × 4
##   cluster   price_first_quartile price_median price_third_quartile
##   <chr>                    <dbl>        <dbl>                <dbl>
## 1 cluster_1               10518.       11878.               13295.
## 2 cluster_2               11041.       12800                13769.

# Bar plot for different number of rooms in properties in division into clusters
ggplot(data = properties_with_clusters, aes(x = Rooms, fill = cluster)) +
  geom_bar()

# Scatter plot for year of building and price in division into clusters
ggplot(data = properties_with_clusters, aes(x = Built_year, y = Price_m2, color = cluster)) +
  geom_point()

As a last step to predict price of \(1m^2\) we build a linear regression model for the clusters separately. The independent variables will be Surface, Rooms, If_Balcony, Level, If_elevator, If_parking and if_gated_community.

# Data for cluster_1
cluster1 <- properties_with_clusters %>%
  filter(cluster == c("cluster_1"))
# Data for cluster_2
cluster2 <- properties_with_clusters %>%
  filter(cluster == c("cluster_2"))

table(cluster1$Location)

## 
##        Bemowo     Bialolęka       Bielany       Mokotow        Ochota 
##           154           201           172            63            90 
##   Pragapolnoc Pragapoludnie     Rembertow   Srodmiescie      Targowek 
##            57           142            69            36           136 
##         Ursus       Ursynow         Wawer        Wesola       Wilanow 
##           133           178           114            40            51 
##        Wlochy          Wola      Zoliborz 
##            38            85            73

table(cluster2$Location)

## 
##        Bemowo     Bialolęka       Bielany       Mokotow        Ochota 
##            43            24             3            10             7 
##   Pragapolnoc Pragapoludnie     Rembertow   Srodmiescie      Targowek 
##             8            29            13             2             4 
##         Ursus       Ursynow         Wawer        Wesola       Wilanow 
##            16            15            40            18            17 
##        Wlochy          Wola      Zoliborz 
##            17            12             1

Linear regression

model_cluster_1 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + Level + If_elevator + If_parking + if_gated_community,
                      data = cluster1)
summary(model_cluster_1)

## 
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + Level + 
##     If_elevator + If_parking + if_gated_community, data = cluster1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6100.8 -1313.7   -64.1  1319.5  4160.5 
## 
## Coefficients: (2 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13757.285    166.598  82.578  < 2e-16 ***
## Surface              -16.648      4.234  -3.932 8.74e-05 ***
## Rooms               -385.078     81.376  -4.732 2.39e-06 ***
## If_Balcony          -157.763     80.050  -1.971 0.048897 *  
## Level                 24.156     15.220   1.587 0.112659    
## If_elevator          374.992     98.596   3.803 0.000147 ***
## If_parking                NA         NA      NA       NA    
## if_gated_community        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1678 on 1826 degrees of freedom
## Multiple R-squared:  0.1061, Adjusted R-squared:  0.1036 
## F-statistic: 43.33 on 5 and 1826 DF,  p-value: < 2.2e-16

model_cluster_2 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + Level + If_elevator + If_parking + if_gated_community,
                      data = cluster2)
summary(model_cluster_2)

## 
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + Level + 
##     If_elevator + If_parking + if_gated_community, data = cluster2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4460.7  -860.6   120.4  1055.3  4287.4 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        12063.437    552.266  21.844  < 2e-16 ***
## Surface               -4.851      9.605  -0.505   0.6139    
## Rooms               -226.169    204.181  -1.108   0.2690    
## If_Balcony          -202.721    193.922  -1.045   0.2968    
## Level                 97.561     46.865   2.082   0.0383 *  
## If_elevator         1652.241    215.772   7.657 3.36e-13 ***
## If_parking           120.801    280.191   0.431   0.6667    
## if_gated_community  -210.144    266.234  -0.789   0.4306    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1528 on 271 degrees of freedom
## Multiple R-squared:  0.268,  Adjusted R-squared:  0.2491 
## F-statistic: 14.18 on 7 and 271 DF,  p-value: 1.131e-15

Cluster_1 model
Let’s firstly analyze linear model for cluster_1. It was estimated on 1832 observations. The observations are homogeneous in terms of variables If_parking and if_gated_community so the coefficients were not estimated. With the significance level of 5% all variables apart from Level are statistically significant. In next step variable Level will be removed to prevent from bias.

model_cluster_1 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + If_elevator,
                      data = cluster1)
summary(model_cluster_1)

## 
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + If_elevator, 
##     data = cluster1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6076.5 -1299.8   -54.9  1321.0  4239.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13832.96     159.70  86.620  < 2e-16 ***
## Surface       -17.52       4.20  -4.170 3.19e-05 ***
## Rooms        -374.08      81.11  -4.612 4.27e-06 ***
## If_Balcony   -143.22      79.56  -1.800    0.072 .  
## If_elevator   403.74      96.96   4.164 3.27e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1678 on 1827 degrees of freedom
## Multiple R-squared:  0.1048, Adjusted R-squared:  0.1029 
## F-statistic: 53.48 on 4 and 1827 DF,  p-value: < 2.2e-16

Now variable If_Balcony is not significant. We will also drop it. Model for the first cluster (cluster_1) with all significant variables (Surface, Rooms, If_elevator) is presented below. Adjusted R-squared with the value of 10,1% means that the model explain 10.1% total variability of dependent variable Price_m2.

# FIinal model for cluster_1
model_cluster_1 <- lm(Price_m2 ~ Surface + Rooms + If_elevator,
                      data = cluster1)
summary(model_cluster_1)

## 
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_elevator, data = cluster1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6005.5 -1305.4   -71.9  1301.6  4331.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13783.380    157.401  87.569  < 2e-16 ***
## Surface       -17.307      4.201  -4.120 3.96e-05 ***
## Rooms        -389.283     80.723  -4.822 1.54e-06 ***
## If_elevator   400.438     97.000   4.128 3.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1679 on 1828 degrees of freedom
## Multiple R-squared:  0.1032, Adjusted R-squared:  0.1018 
## F-statistic: 70.15 on 3 and 1828 DF,  p-value: < 2.2e-16

Analysis of the model (cluster_1)

Now it will be necessary to check whether the formula was correctly specified. In this case we perform Ramsey Reset test. The p-value for our F-stat is 16.1%. Therefore, at 5% significance level, we fail to reject the Ramsey RESET test null hypothesis of correct specification. This indicates that the functional form is correct and our model does not suffer from omitted variables.

model_cluster_1_RESET <- resettest(model_cluster_1, power = 2, type = "fitted", data = properties_numeric)
model_cluster_1_RESET

## 
##  RESET test
## 
## data:  model_cluster_1
## RESET = 1.9652, df1 = 1, df2 = 1827, p-value = 0.1611

We can assume that residuals in the model are normally distributed because of the size of the data set. In the last step we will check homoscedasticity of the residuals using Breusch-Pagan test.

bptest(model_cluster_1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_cluster_1
## BP = 67.254, df = 3, p-value = 1.652e-14

P-value in Breusch-Pagan test is lower than significance level (5%) so we reject the null hypothesis about homoscedasticity of the residuals in favor of alternative hypothesis - variance of residuals is heterogeneous.

Cluster_2 model
We move onto model for second cluster (cluster_2). The model was estimated on 279 observations. In this model there are only 2 statistically significant variables - If_elevator and Level (level of significance 5%). To obtain a model with all variables statistically significant we will step by step remove not statistically significant variables starting with ones that have the highest p-value.

model_cluster_2 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + Level + If_elevator + if_gated_community,
                      data = cluster2)
summary(model_cluster_2)

## 
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + Level + 
##     If_elevator + if_gated_community, data = cluster2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4462.8  -857.4   141.8  1059.3  4252.9 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        12191.735    464.540  26.245  < 2e-16 ***
## Surface               -4.589      9.571  -0.479   0.6320    
## Rooms               -234.779    202.898  -1.157   0.2482    
## If_Balcony          -202.756    193.632  -1.047   0.2960    
## Level                 98.237     46.769   2.100   0.0366 *  
## If_elevator         1650.679    215.419   7.663 3.22e-13 ***
## if_gated_community  -292.866    184.295  -1.589   0.1132    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1525 on 272 degrees of freedom
## Multiple R-squared:  0.2675, Adjusted R-squared:  0.2514 
## F-statistic: 16.56 on 6 and 272 DF,  p-value: 2.881e-16

model_cluster_2 <- lm(Price_m2 ~ Rooms + If_Balcony + Level + If_elevator + if_gated_community,
                      data = cluster2)
summary(model_cluster_2)

## 
## Call:
## lm(formula = Price_m2 ~ Rooms + If_Balcony + Level + If_elevator + 
##     if_gated_community, data = cluster2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4516.2  -863.7   170.0  1050.8  4241.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        12119.71     438.96  27.610  < 2e-16 ***
## Rooms               -312.29     122.46  -2.550   0.0113 *  
## If_Balcony          -196.66     192.94  -1.019   0.3090    
## Level                 98.35      46.70   2.106   0.0361 *  
## If_elevator         1650.84     215.11   7.674 2.96e-13 ***
## if_gated_community  -293.18     184.03  -1.593   0.1123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1523 on 273 degrees of freedom
## Multiple R-squared:  0.2669, Adjusted R-squared:  0.2535 
## F-statistic: 19.88 on 5 and 273 DF,  p-value: < 2.2e-16

model_cluster_2 <- lm(Price_m2 ~ Rooms + Level + If_elevator + if_gated_community,
                      data = cluster2)
summary(model_cluster_2)

## 
## Call:
## lm(formula = Price_m2 ~ Rooms + Level + If_elevator + if_gated_community, 
##     data = cluster2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4411.9  -896.9   121.8  1047.5  4160.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        11999.85     422.95  28.372  < 2e-16 ***
## Rooms               -307.10     122.36  -2.510   0.0127 *  
## Level                 91.76      46.26   1.984   0.0483 *  
## If_elevator         1644.01     215.03   7.646 3.53e-13 ***
## if_gated_community  -296.53     184.02  -1.611   0.1082    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1523 on 274 degrees of freedom
## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2534 
## F-statistic: 24.59 on 4 and 274 DF,  p-value: < 2.2e-16

# FIinal model for cluster_2
model_cluster_2 <- lm(Price_m2 ~ Level + Rooms + If_elevator,
                      data = cluster2)
summary(model_cluster_2)

## 
## Call:
## lm(formula = Price_m2 ~ Level + Rooms + If_elevator, data = cluster2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4292.4  -855.6   149.3  1121.5  3979.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11873.24     416.79  28.487  < 2e-16 ***
## Level          92.86      46.39   2.002   0.0463 *  
## Rooms        -304.72     122.71  -2.483   0.0136 *  
## If_elevator  1601.13     213.99   7.482 9.89e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1528 on 275 degrees of freedom
## Multiple R-squared:  0.2571, Adjusted R-squared:  0.249 
## F-statistic: 31.73 on 3 and 275 DF,  p-value: < 2.2e-16

Final version of the model contains 3 statistically significant variables - Rooms, Level and If_elevator (significance level 5%). Adjusted R-squared with the value of 24,9% means that the model explains 24.9% total variability of dependent variable Price_m2. It is much higher than in the first model but based on literature the level is still low.

Analysis of the model (cluster_2)

The same analysis as for the first cluster is necessary to check whether the formula for second cluster was correctly specified. In this case we perform Ramsey Reset test. The p-value for our F-stat is 47.5%. Therefore, at 5% significance level, we fail to reject the Ramsey RESET test null hypothesis of correct specification. This indicates that the functional form is correct and our model does not suffer from omitted variables.

model_cluster_2_RESET <- resettest(model_cluster_2, power = 2, type = "fitted", data = properties_numeric)
model_cluster_2_RESET

## 
##  RESET test
## 
## data:  model_cluster_2
## RESET = 0.51251, df1 = 1, df2 = 274, p-value = 0.4747

As same as in the model for cluster_1 - we can assume that our residuals are normally distributed because of the size of the data set. In the last step we will check homoscedasticity of the residuals using Breusch-Pagan test.

bptest(model_cluster_2)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_cluster_2
## BP = 13.179, df = 3, p-value = 0.004265

Comparison of final models

The linear models satisfy Ramsey Reset test (the formula for both models is correct), and have randomly distributed residuals due to the size of the data specified. Both of the models fail in Breusch-Pagan test which means that we observe heteroscedasticity of the residuals. To sum up mentioned verifications, both of the models do not satisfy Gauss-Markov theorem so there are another models for every of our clusters that has lowest sampling variance within the class of linear unbiased estimators.

For both models power of explanation is rather low. In future improvements it might be necessary to add crucial variables for properties price analysis or add interaction of the variables.

model_cluster_1$coefficients

## (Intercept)     Surface       Rooms If_elevator 
## 13783.38042   -17.30721  -389.28308   400.43818

model_cluster_2$coefficients

## (Intercept)       Level       Rooms If_elevator 
## 11873.24242    92.85943  -304.71614  1601.12776

Based on estimated coefficients we can infer that price per squared meter of the property without direct access to the elevator is much higher in the first cluster. The price gap is even higher when we consider properties located on lower floor, having small amount of rooms and small surface (i.e. studio flat). On the other hand price of the property in the second cluster tends to be higher whereas the property is situated on the top floor and has direct access to the elevator.

At the end we present visualizations of values of Price_m2 vs values fitted by the models.

Model for cluster_1

model1.fitted <- fitted.values(model_cluster_1)
cluster1 %>% ggplot(aes(model1.fitted, Price_m2)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model 1")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Model for cluster_2

model2.fitted <- fitted.values(model_cluster_2)
cluster2 %>% ggplot(aes(model2.fitted, Price_m2)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model 2")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Clustering properties using different methods

Hubert Magdziak

2023-12-13