The case of this paper is to divide the properties into clusters. Using clustering methods we obtain groups of similar objects which as expected will be dissimilar between groups. Clustering assists in market segmentation, dividing the overall property market into distinct groups. This can be valuable for targeted marketing strategies, as different clusters may have varying needs, preferences, and price sensitivities.On the other hand, clustering can be used for assessing risk. Properties within the same cluster are likely to share similar risks and vulnerabilities. This information is valuable for investors, allowing them to diversify their portfolios and manage risk more effectively. Finally, this allows for a deeper understanding of the competitive landscape, helping property professionals make informed decisions about pricing, marketing, and overall strategy.
The medoid of a cluster is defined as the object in the cluster whose sum (and, equivalently, the average) of dissimilarities to all the objects in the cluster is minimal, that is, it is a most centrally located point in the cluster.
The centroid of a cluster is defined as mean point in the cluster - represents center of the cluster. It might nooot be a member of the dataset
K-means clustering is popular unsupervised machine
learning algorithm. Aim of the clustering is to partition n observations
into k clusters. To perform this algorithm we have to follow steps
below:
- Specify number of centroids
- Assign each observation to the cluster with the nearest mean
- Recalculate centroids for observations assigned to each cluster
The algorithm has converged when the assignments no longer change. It is
important to point out that the algorithm is not guaranteed to find the
optimum. The algorithm uses often distance to assign observation to
cluster. Using a different distance function other than (squared)
Euclidean (i.e. Manhattan, Minkowski, Hamming, …) distance may prevent
the algorithm from converging.
PAM (Partitioning Around Medoids) is a clustering
unsupervised machine learning algorithm similar to k-means. In contrast
to the k-means algorithm, k-medoids chooses medoids as centers (in
K-means we use centroids), and thereby allows for better interpretation
of the cluster centers than in k-means where the center of a cluster is
not necessarily one of the input data points. Main disadvantage of this
algorithm is that number of clusters must be known a priori. The perform
this algorithm we have to follow steps below: - Specify number of
medoids
- Associate each observation to the closest medoid
- For each medoid consider the swap of medoid to nonmedoid. If the
change causes decrease of within sum of squares, change the
points.
- Reassign each point to the cluster defined by the closest medoid
determined in the previous step
- Repeat two steps above until there is no possibility to change medoid
with nonmedoid point and get lower within sum of squares.
Hierarchical clustering It is a method of cluster
analysis that seeks to build a hierarchy of clusters. Strategies for
hierarchical clustering generally fall into two categories:
- Agglomerative (bottom-up) - each observation starts as its own cluster
and pairs of the most similar points are merged until there is one
cluster containing all observations from initial data set.
- Divisive (top-down) - reverse to agglomerative as it starts with one
cluster containing all initial observations and ends with each
observation as its own cluster
Results of hierarchical clustering are usually present in a
dendrogram.
Silhouette width
The silhouette value is a measure of how similar an object is to its own
cluster compared to other clusters. The silhouette ranges from −1 to +1,
where a high value indicates that the object is well matched to its own
cluster and poorly matched to neighboring clusters. If most objects have
a high value, then the clustering configuration is appropriate. If many
points have a low or negative value, then the clustering configuration
may have too many or too few clusters. A clustering with an average
silhouette width of over 0.7 is considered to be “strong”, a value over
0.5 “reasonable” and over 0.25 “weak”. Negative value of silhouette
means that the clusters are not proper and it is high probability that
the observation should be inside of other cluster. With increasing
dimensionality of the data, it becomes difficult to achieve such high
values because of the curse of dimensionality, as the distances become
more similar.
Duda-Hart index is a test for quality measure for
the clustering. It is well defined for kmeans class and its hypotheses
are as follows:
- \(H_0\): homogeneity of cluster (data
within cluster as similar)
- \(H_1\): heterogeneity of cluster
(one can easily split the cluster)
It calculates statistics ratio of within-cluster sum of squares for two
clusters dh) overall sum of squares.
Verification of hypotheses: cluster1=FALSE (\(H_0\) of homogeneity rejected, accept \(H_1\))
Calinski-Harabasz index is a metric for evaluating clustering algorithms. Its construction is a fraction with between-group sum of squares in the counter and within-cluster sum of squares (sum of the within-cluster dispersion for all clusters) in the nominator. Based on the formula we can see that the higher statistics the better as it means that between-group sum of squares is higher or within-cluster sum of squares is lower (or both). The statistic is usually used for comparing solutions for alternative number of clusters rather that different algorithms.
Linear regression
Linear regression is used to predict the value of an outcome variable
\(Y\) based on one or more input
predictor variables \(X\). The aim is
to establish a linear relationship between the dependent variable and
the independent variables, so that, we can use this formula to estimate
the value of the response \(Y\), when
only the predictors \(X\) values are
known. In the simplest model with one variable the mathematical formula
looks as presented: \(Y = β_1 + β_2X + ϵ\) where, \(β_1\) is the intercept and \(β_2\) is the slope. Collectively, they are
called regression coefficients. ϵ is the error term, the part of Y the
regression model is unable to explain.
The data set used in this paper is sourced from the morizon.pl
website using webscraping method. This study focuses on modeling
apartment prices in Warsaw that an average person could afford.
Therefore, certain constraints were imposed on the dataset while
webscraping the data. The acquired data satisfies the following
conditions:
- The price per square meter of the apartment is not greater than 15,000
PLN/\(m^2\)
- The area of the apartment does not exceed 100 \(m^2\)
Number of observations: 2111
Number of variables: 11
Number of numeric variables: 9
List of attributes with explanation
| Variable name | Variable explanation |
|---|---|
| Location | location of the property (1 out of 16 districts) |
| Market | aftermarket/primary market |
| Price_m2 | price of \(m^2\) in zl |
| Surface | surface of the property in m2 |
| Rooms | number of rooms |
| If_Balcony | 1 means that property has balcony, 0 otherwise |
| Level | on which floor the property is situated |
| Built_year | year of built |
| If_elevator | 1 means that property has access to elevator, 0 otherwise |
| If_parking | 1 means that property has access to individual parking place, 0 otherwise |
| if_gated_community | 1 means that property is situated on gated community, 0 otherwise |
List of packages necessary for proceeding the code
set.seed(123)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Warning: package 'flexclust' was built under R version 4.3.2
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
## Warning: package 'fpc' was built under R version 4.3.2
library(clustertend)
## Package `clustertend` is deprecated. Use package `hopkins` instead.
library(cluster)
## Warning: package 'cluster' was built under R version 4.3.2
library(ClusterR)
## Warning: package 'ClusterR' was built under R version 4.3.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dendextend)
## Warning: package 'dendextend' was built under R version 4.3.2
##
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
##
## Attaching package: 'dendextend'
##
## The following object is masked from 'package:stats':
##
## cutree
library(readxl)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(dendextend)
library(dplyr)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
properties <- read_xlsx("C:/Users/User/Desktop/Studia/Praca licencjacka z ekonomii/Licencjat Ekonomia/Properties.xlsx")
# Check first values of every row
glimpse(properties)
## Rows: 2,111
## Columns: 11
## $ Location <chr> "Ursynow", "srodmiescie", "Bielany", "srodmiescie",…
## $ Market <chr> "aftermarket", "aftermarket", "aftermarket", "after…
## $ Price_m2 <dbl> 15000.00, 15000.00, 15000.00, 15000.00, 15000.00, 1…
## $ Surface <dbl> 33.00, 34.68, 46.00, 50.00, 51.60, 54.00, 59.00, 60…
## $ Rooms <chr> "1", "1", "2", "3", "2", "3", "3", "2", "3", "2", "…
## $ If_Balcony <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, …
## $ Level <dbl> 0, 3, 6, 1, 2, 0, 5, 6, 2, 0, 5, 1, 1, 2, 1, 0, 2, …
## $ Built_year <chr> "2020", "1959", "2008", "1952", "2022", "1960", "19…
## $ If_elevator <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ If_parking <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ if_gated_community <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
properties$Location<-str_to_title(properties$Location)
# Variables Rooms and Built_year should be numeric
str(properties)
## tibble [2,111 × 11] (S3: tbl_df/tbl/data.frame)
## $ Location : chr [1:2111] "Ursynow" "Srodmiescie" "Bielany" "Srodmiescie" ...
## $ Market : chr [1:2111] "aftermarket" "aftermarket" "aftermarket" "aftermarket" ...
## $ Price_m2 : num [1:2111] 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 ...
## $ Surface : num [1:2111] 33 34.7 46 50 51.6 ...
## $ Rooms : chr [1:2111] "1" "1" "2" "3" ...
## $ If_Balcony : num [1:2111] 0 0 1 0 0 0 0 0 0 0 ...
## $ Level : num [1:2111] 0 3 6 1 2 0 5 6 2 0 ...
## $ Built_year : chr [1:2111] "2020" "1959" "2008" "1952" ...
## $ If_elevator : num [1:2111] 0 0 0 0 0 0 1 1 0 0 ...
## $ If_parking : num [1:2111] 0 0 0 0 0 0 0 1 0 0 ...
## $ if_gated_community: num [1:2111] 0 0 0 0 0 0 0 0 0 0 ...
# Converting Variables Rooms and Built_year to numeric
properties$Rooms <- as.numeric(properties$Rooms)
properties$Built_year <- as.numeric(properties$Built_year)
# There are no missing values
properties[!complete.cases(properties), ]
## # A tibble: 0 × 11
## # ℹ 11 variables: Location <chr>, Market <chr>, Price_m2 <dbl>, Surface <dbl>,
## # Rooms <dbl>, If_Balcony <dbl>, Level <dbl>, Built_year <dbl>,
## # If_elevator <dbl>, If_parking <dbl>, if_gated_community <dbl>
aggregation <- aggregate(properties$Price_m2 ~ properties$Rooms, FUN = median)
colnames(aggregation) <- c("Number of rooms", "Median of 1m2 price")
aggregation
## Number of rooms Median of 1m2 price
## 1 1 12987.78
## 2 2 12601.94
## 3 3 11661.81
## 4 4 10846.34
## 5 5 10791.33
# Overview statistics of the data
summary(properties)
## Location Market Price_m2 Surface
## Length:2111 Length:2111 Min. : 5583 Min. :13.83
## Class :character Class :character 1st Qu.:10595 1st Qu.:46.00
## Mode :character Mode :character Median :11965 Median :56.00
## Mean :11924 Mean :57.13
## 3rd Qu.:13386 3rd Qu.:67.40
## Max. :15000 Max. :99.70
## Rooms If_Balcony Level Built_year
## Min. :1.000 Min. :0.0000 Min. :-1.000 Min. :1891
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.:1975
## Median :3.000 Median :1.0000 Median : 2.000 Median :2000
## Mean :2.666 Mean :0.5623 Mean : 2.793 Mean :1994
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.: 4.000 3rd Qu.:2018
## Max. :5.000 Max. :1.0000 Max. :18.000 Max. :2025
## If_elevator If_parking if_gated_community
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.2729 Mean :0.08716 Mean :0.06585
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000
# There is much greatest share of aftermarket properties in the data set
table(properties$Market)
##
## aftermarket primary_market
## 1840 271
table(properties$Location, properties$Market)
##
## aftermarket primary_market
## Bemowo 169 28
## Bialolęka 197 28
## Bielany 132 43
## Mokotow 72 1
## Ochota 97 0
## Pragapolnoc 64 1
## Pragapoludnie 163 8
## Rembertow 58 24
## Srodmiescie 38 0
## Targowek 118 22
## Ursus 133 16
## Ursynow 190 3
## Wawer 115 39
## Wesola 44 14
## Wilanow 57 11
## Wlochy 33 22
## Wola 86 11
## Zoliborz 74 0
# Number of properties in the dataset for different locations in division to Market
ggplot(data = properties, aes(y = Location, fill = Market)) +
geom_bar() +
ggtitle("Number of properties in the dataset for different locations in division to market type") +
xlab("Count") +
theme(plot.title = element_text(hjust = 0.5))
# Subseting all numeric variables
properties_numeric <- properties[,3:11]
# Z-score standarization of the data
properties_numeric_z <- as.data.frame(lapply(properties_numeric, scale))
Next step will be a correlation analysis. It is important to see whether the variables are correlated. High correlation of variables might cause bias in further analysis. From this plot we observe that only Surface and Rooms are highly linearly correlated (0.79). In our case we would like to analyze data set containing all variables as it is an extension of other analysis on the same data. For future improvements of the report it might be necessary to use dimension reduction methods which will not be covered in this report.
corrplot(cor(properties_numeric), type = "upper", order = "hclust", method = "number")
Firstly, it will be important to measure cluster tendency of the data. To check clusterability of the data we use Hopkins statistics. It acts as a statistical hypothesis test whether the data is uniformly distributed (without any significant clusters) and alternative hypothesis is that the data is not uniformly distributed so contains significant clusters.
| h~0 | h~0.5 | h~1 |
|---|---|---|
|
|
|
Based on value of Hopkins statistic, we reject null hypothesis that data is uniformly distributed in favor to alternative hypothesis that our data is not uniformly distributed and contains significant clusters.
clusterability <- get_clust_tendency(properties_numeric_z, 2, graph=TRUE,
gradient=list(low="red", mid="white", high="blue"), seed = 123)
# Checking hopkins statistics - it is pretty close to 1 so the data is definitely clusterable
clusterability$hopkins_stat
## [1] 0.9570934
clusterability$plot
Refering to the methodology part, for KMeans clustering method we have to provide number of clusters in advance. Fortunately, there are some methods that we can use to find optimal number of clusters. In this case we will use Silhouette criterion and care about percent of variance explained by the clustering.
optimal_clusters<-Optimal_Clusters_KMeans(properties_numeric_z, max_clusters=10,
plot_clusters = TRUE, criterion = "variance_explained")
optimal_clusters1<-Optimal_Clusters_KMeans(properties_numeric_z, max_clusters=10,
plot_clusters=TRUE, criterion="silhouette")
cluster_KMeans <- eclust(properties_numeric_z, "kmeans", k = 4)
sill<-silhouette(cluster_KMeans$cluster, dist(properties_numeric_z))
fviz_silhouette(sill, print.summary = T)
## cluster size ave.sil.width
## 1 1 718 0.17
## 2 2 239 0.11
## 3 3 763 0.22
## 4 4 391 0.14
We obtain 4 clusters with the highest average silhouette width for fourth cluster (0.19) and the lowest for second cluster (0.10). The average silhouette statistics is low (0.17) so it indicates that the object is weakly matched to its own cluster and some of the points can be matched to neighboring clusters. To validate obtained result we will check quality of clustering using Duda-Hart index.
# DUDA-HART INDEX
dh_index <- dudahart2(properties_numeric_z, cluster_KMeans$cluster)
dh_index
## $p.value
## [1] 0
##
## $dh
## [1] 0.3382467
##
## $compare
## [1] 0.89902
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
The output of dh_index: cluster1=FALSE. Refering to description of Duda-Hart index in methodology, we reject \(H_0\) in favor of \(H_1\). The clusters are heterogeneous - one can easily split the cluster.
We could also choose 10 clusters… Taking into consideration silhouette width we would go for 4 or 10 clusters. In terms of simplicity we would go for 4 clusters but it usually depends on the size of the data set. Formally, we can also check Calinski-Harabasz index for comparing alternative number of clusters.
cluster_KMeans10 <- eclust(properties_numeric_z, "kmeans", k = 10)
round(calinhara(properties_numeric_z, cluster_KMeans$cluster),digits=2)
## [1] 380.55
round(calinhara(properties_numeric_z, cluster_KMeans10$cluster),digits=2)
## [1] 314.82
Refering to the methodology part, the higher index the better. Comparing to 10 clusters, for clustering with 4 centroids the data points are more spread out between clusters than they are within clusters (then 4 clusters are preferred).
Summary of KMeans clustering
In reference to analysis of clustering quality, the accuracy of
clustering is weak. The observations intra cluster are heterogeneous.
The clusters are not representative so with high probability further
analysis of the clustering with KMeans will have low accuracy.
##
## Based on the plot give the number of clusters (greater than 1) that you consider optimal?
## Warning: The plot can not be created for the specified number of clusters. This
## means the output data do not fit in the figure (plot) margins.
Based on number of variance_explained, silhouette width and simplicity of the result we chose 3 clusters.
pam<-eclust(properties_numeric_z, "pam", k=3, hc_metric="euclidean")
fviz_silhouette(pam)
## cluster size ave.sil.width
## 1 1 750 0.15
## 2 2 476 0.03
## 3 3 885 0.21
pam$medoids
## Price_m2 Surface Rooms If_Balcony Level Built_year
## [1,] 0.2104559 -0.6603025 -0.8499834 -1.1331476 -0.3047686 -0.3869399
## [2,] 0.6053150 0.1871874 0.4262008 0.8820795 0.0793308 -0.3456483
## [3,] -0.3914696 0.6304898 0.4262008 0.8820795 -0.3047686 0.3976002
## If_elevator If_parking if_gated_community
## [1,] -0.6124268 -0.3089336 -0.2654306
## [2,] 1.6320748 -0.3089336 -0.2654306
## [3,] -0.6124268 -0.3089336 -0.2654306
dh_pam_index <- dudahart2(properties_numeric_z, pam$cluster)
dh_pam_index
## $p.value
## [1] 0
##
## $dh
## [1] 0.5003245
##
## $compare
## [1] 0.89902
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
As same as for KMeans clustering, the output of dh_index is cluster1=FALSE. Refering to description of Duda-Hart index in methodology, we reject \(H_0\) in favor of \(H_1\). The clusters are heterogeneous - one can easily split the cluster.
Summary of PAM clustering
In reference to analysis of clustering quality, the accuracy of
clustering is weak. The observations intra cluster are heterogeneous.
The clusters are not representative so with high probability further
analysis of the clustering with PAM will have low accuracy.
To specify method which will be used in hierarchical clustering, we calculate agglomerative coefficients for average, single, complete and ward methods.
dissimilarity_matrix <- dist(properties_numeric_z, method = "euclidean")
# Different methods which can be used to hierarchical clustering
hclust_methods <- c( "average", "single", "complete", "ward")
names(hclust_methods) <- c( "average", "single", "complete", "ward")
# Agglomerative coefficients for the methods
ac <- function(x) {
agnes(dissimilarity_matrix, method = x)$ac
}
map_dbl(hclust_methods, ac)
## average single complete ward
## 0.9064463 0.8748009 0.9455979 0.9922546
Agglomerative coefficient measures the amount of clustering structure found. Values closer to 1 suggest strong clustering structure so the higher the better. Ward method obtained the highest value so it will be used for the clustering.
# Hierarchical clustering using ward method
hierarchical_clustering <- hclust(dissimilarity_matrix, method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
# Dendrogram
plot(hierarchical_clustering, cex = 0.01, hang = -1, xlab="", sub="")
In hierarchical clustering there is no need to specify number of clusters a priori. Although, it must be specified when we want to cut the dendrogram. In this case silhouette width and elbow method are performed.
# Elbow method
fviz_nbclust(properties_numeric_z, FUN = hcut, method = "wss")
# Silhouette width
fviz_nbclust(properties_numeric_z, FUN = hcut, method = "silhouette")
It is no significant drop in the plot for elbow method so we cannot choose optimal number of clusters explicitly. Luckily, average silhouette width for two clusters is significantly higher than for the others. Therefore 2 clusters are chosen.
hc <- hclust(dissimilarity_matrix, method = "ward.D2")
# Cut tree into 4 groups
hierarchical_clusters <- cutree(hc, k = 2)
# Dendrogram with division on the clusters
plot(hc, cex = 0.1)
rect.hclust(hc, k = 2, border = 2:5)
# More visualization
fviz_cluster(list(data = properties_numeric_z, cluster = hierarchical_clusters))
# Silhouette width for the clustering
fviz_silhouette(silhouette(cutree(hierarchical_clustering, 2), dissimilarity_matrix))
## cluster size ave.sil.width
## 1 1 1832 0.36
## 2 2 279 0.12
Average silhouette width for the hierarchical clustering is 0.33. Comparing this result to KMeans and PAM algorithms, hierarchical clustering do the best - KMeans (0.17), PAM (0.15).
Obtained clusters are not homogeneous (Duda-Hart index - \(H_0\) rejected in favor of \(H_1\)) but dh value is significantly higher than for KMeans and PAM.
dh_hc_index <- dudahart2(properties_numeric_z, hierarchical_clusters)
dh_hc_index
## $p.value
## [1] 2.88658e-15
##
## $dh
## [1] 0.8528244
##
## $compare
## [1] 0.89902
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Hierarchical clustering method obtained the highest silhouette value, therefore it was chosen to perform basic analysis of the clusters.
properties_with_clusters <- properties %>%
mutate(cluster = sapply(hierarchical_clusters, function(x) {if(x==1){"cluster_1"} else{"cluster_2"}}))
properties_with_clusters %>%
group_by(cluster) %>%
summarize(price_first_quartile = quantile(Price_m2, 0.25), price_median = quantile(Price_m2, 0.5),
price_third_quartile = quantile(Price_m2, 0.75))
## # A tibble: 2 × 4
## cluster price_first_quartile price_median price_third_quartile
## <chr> <dbl> <dbl> <dbl>
## 1 cluster_1 10518. 11878. 13295.
## 2 cluster_2 11041. 12800 13769.
# Bar plot for different number of rooms in properties in division into clusters
ggplot(data = properties_with_clusters, aes(x = Rooms, fill = cluster)) +
geom_bar()
# Scatter plot for year of building and price in division into clusters
ggplot(data = properties_with_clusters, aes(x = Built_year, y = Price_m2, color = cluster)) +
geom_point()
As a last step to predict price of \(1m^2\) we build a linear regression model for the clusters separately. The independent variables will be Surface, Rooms, If_Balcony, Level, If_elevator, If_parking and if_gated_community.
# Data for cluster_1
cluster1 <- properties_with_clusters %>%
filter(cluster == c("cluster_1"))
# Data for cluster_2
cluster2 <- properties_with_clusters %>%
filter(cluster == c("cluster_2"))
table(cluster1$Location)
##
## Bemowo Bialolęka Bielany Mokotow Ochota
## 154 201 172 63 90
## Pragapolnoc Pragapoludnie Rembertow Srodmiescie Targowek
## 57 142 69 36 136
## Ursus Ursynow Wawer Wesola Wilanow
## 133 178 114 40 51
## Wlochy Wola Zoliborz
## 38 85 73
table(cluster2$Location)
##
## Bemowo Bialolęka Bielany Mokotow Ochota
## 43 24 3 10 7
## Pragapolnoc Pragapoludnie Rembertow Srodmiescie Targowek
## 8 29 13 2 4
## Ursus Ursynow Wawer Wesola Wilanow
## 16 15 40 18 17
## Wlochy Wola Zoliborz
## 17 12 1
model_cluster_1 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + Level + If_elevator + If_parking + if_gated_community,
data = cluster1)
summary(model_cluster_1)
##
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + Level +
## If_elevator + If_parking + if_gated_community, data = cluster1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6100.8 -1313.7 -64.1 1319.5 4160.5
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13757.285 166.598 82.578 < 2e-16 ***
## Surface -16.648 4.234 -3.932 8.74e-05 ***
## Rooms -385.078 81.376 -4.732 2.39e-06 ***
## If_Balcony -157.763 80.050 -1.971 0.048897 *
## Level 24.156 15.220 1.587 0.112659
## If_elevator 374.992 98.596 3.803 0.000147 ***
## If_parking NA NA NA NA
## if_gated_community NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1678 on 1826 degrees of freedom
## Multiple R-squared: 0.1061, Adjusted R-squared: 0.1036
## F-statistic: 43.33 on 5 and 1826 DF, p-value: < 2.2e-16
model_cluster_2 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + Level + If_elevator + If_parking + if_gated_community,
data = cluster2)
summary(model_cluster_2)
##
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + Level +
## If_elevator + If_parking + if_gated_community, data = cluster2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4460.7 -860.6 120.4 1055.3 4287.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12063.437 552.266 21.844 < 2e-16 ***
## Surface -4.851 9.605 -0.505 0.6139
## Rooms -226.169 204.181 -1.108 0.2690
## If_Balcony -202.721 193.922 -1.045 0.2968
## Level 97.561 46.865 2.082 0.0383 *
## If_elevator 1652.241 215.772 7.657 3.36e-13 ***
## If_parking 120.801 280.191 0.431 0.6667
## if_gated_community -210.144 266.234 -0.789 0.4306
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1528 on 271 degrees of freedom
## Multiple R-squared: 0.268, Adjusted R-squared: 0.2491
## F-statistic: 14.18 on 7 and 271 DF, p-value: 1.131e-15
Cluster_1 model
Let’s firstly analyze linear model for cluster_1. It was estimated on
1832 observations. The observations are homogeneous in terms of
variables If_parking and if_gated_community so the coefficients were not
estimated. With the significance level of 5% all variables apart from
Level are statistically significant. In next step variable Level will be
removed to prevent from bias.
model_cluster_1 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + If_elevator,
data = cluster1)
summary(model_cluster_1)
##
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + If_elevator,
## data = cluster1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6076.5 -1299.8 -54.9 1321.0 4239.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13832.96 159.70 86.620 < 2e-16 ***
## Surface -17.52 4.20 -4.170 3.19e-05 ***
## Rooms -374.08 81.11 -4.612 4.27e-06 ***
## If_Balcony -143.22 79.56 -1.800 0.072 .
## If_elevator 403.74 96.96 4.164 3.27e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1678 on 1827 degrees of freedom
## Multiple R-squared: 0.1048, Adjusted R-squared: 0.1029
## F-statistic: 53.48 on 4 and 1827 DF, p-value: < 2.2e-16
Now variable If_Balcony is not significant. We will also drop it. Model for the first cluster (cluster_1) with all significant variables (Surface, Rooms, If_elevator) is presented below. Adjusted R-squared with the value of 10,1% means that the model explain 10.1% total variability of dependent variable Price_m2.
# FIinal model for cluster_1
model_cluster_1 <- lm(Price_m2 ~ Surface + Rooms + If_elevator,
data = cluster1)
summary(model_cluster_1)
##
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_elevator, data = cluster1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6005.5 -1305.4 -71.9 1301.6 4331.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13783.380 157.401 87.569 < 2e-16 ***
## Surface -17.307 4.201 -4.120 3.96e-05 ***
## Rooms -389.283 80.723 -4.822 1.54e-06 ***
## If_elevator 400.438 97.000 4.128 3.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1679 on 1828 degrees of freedom
## Multiple R-squared: 0.1032, Adjusted R-squared: 0.1018
## F-statistic: 70.15 on 3 and 1828 DF, p-value: < 2.2e-16
Analysis of the model (cluster_1)
Now it will be necessary to check whether the formula was correctly specified. In this case we perform Ramsey Reset test. The p-value for our F-stat is 16.1%. Therefore, at 5% significance level, we fail to reject the Ramsey RESET test null hypothesis of correct specification. This indicates that the functional form is correct and our model does not suffer from omitted variables.
model_cluster_1_RESET <- resettest(model_cluster_1, power = 2, type = "fitted", data = properties_numeric)
model_cluster_1_RESET
##
## RESET test
##
## data: model_cluster_1
## RESET = 1.9652, df1 = 1, df2 = 1827, p-value = 0.1611
We can assume that residuals in the model are normally distributed because of the size of the data set. In the last step we will check homoscedasticity of the residuals using Breusch-Pagan test.
bptest(model_cluster_1)
##
## studentized Breusch-Pagan test
##
## data: model_cluster_1
## BP = 67.254, df = 3, p-value = 1.652e-14
P-value in Breusch-Pagan test is lower than significance level (5%) so we reject the null hypothesis about homoscedasticity of the residuals in favor of alternative hypothesis - variance of residuals is heterogeneous.
Cluster_2 model
We move onto model for second cluster (cluster_2). The model was
estimated on 279 observations. In this model there are only 2
statistically significant variables - If_elevator and Level (level of
significance 5%). To obtain a model with all variables statistically
significant we will step by step remove not statistically significant
variables starting with ones that have the highest p-value.
model_cluster_2 <- lm(Price_m2 ~ Surface + Rooms + If_Balcony + Level + If_elevator + if_gated_community,
data = cluster2)
summary(model_cluster_2)
##
## Call:
## lm(formula = Price_m2 ~ Surface + Rooms + If_Balcony + Level +
## If_elevator + if_gated_community, data = cluster2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4462.8 -857.4 141.8 1059.3 4252.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12191.735 464.540 26.245 < 2e-16 ***
## Surface -4.589 9.571 -0.479 0.6320
## Rooms -234.779 202.898 -1.157 0.2482
## If_Balcony -202.756 193.632 -1.047 0.2960
## Level 98.237 46.769 2.100 0.0366 *
## If_elevator 1650.679 215.419 7.663 3.22e-13 ***
## if_gated_community -292.866 184.295 -1.589 0.1132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1525 on 272 degrees of freedom
## Multiple R-squared: 0.2675, Adjusted R-squared: 0.2514
## F-statistic: 16.56 on 6 and 272 DF, p-value: 2.881e-16
model_cluster_2 <- lm(Price_m2 ~ Rooms + If_Balcony + Level + If_elevator + if_gated_community,
data = cluster2)
summary(model_cluster_2)
##
## Call:
## lm(formula = Price_m2 ~ Rooms + If_Balcony + Level + If_elevator +
## if_gated_community, data = cluster2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4516.2 -863.7 170.0 1050.8 4241.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12119.71 438.96 27.610 < 2e-16 ***
## Rooms -312.29 122.46 -2.550 0.0113 *
## If_Balcony -196.66 192.94 -1.019 0.3090
## Level 98.35 46.70 2.106 0.0361 *
## If_elevator 1650.84 215.11 7.674 2.96e-13 ***
## if_gated_community -293.18 184.03 -1.593 0.1123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1523 on 273 degrees of freedom
## Multiple R-squared: 0.2669, Adjusted R-squared: 0.2535
## F-statistic: 19.88 on 5 and 273 DF, p-value: < 2.2e-16
model_cluster_2 <- lm(Price_m2 ~ Rooms + Level + If_elevator + if_gated_community,
data = cluster2)
summary(model_cluster_2)
##
## Call:
## lm(formula = Price_m2 ~ Rooms + Level + If_elevator + if_gated_community,
## data = cluster2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4411.9 -896.9 121.8 1047.5 4160.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11999.85 422.95 28.372 < 2e-16 ***
## Rooms -307.10 122.36 -2.510 0.0127 *
## Level 91.76 46.26 1.984 0.0483 *
## If_elevator 1644.01 215.03 7.646 3.53e-13 ***
## if_gated_community -296.53 184.02 -1.611 0.1082
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1523 on 274 degrees of freedom
## Multiple R-squared: 0.2641, Adjusted R-squared: 0.2534
## F-statistic: 24.59 on 4 and 274 DF, p-value: < 2.2e-16
# FIinal model for cluster_2
model_cluster_2 <- lm(Price_m2 ~ Level + Rooms + If_elevator,
data = cluster2)
summary(model_cluster_2)
##
## Call:
## lm(formula = Price_m2 ~ Level + Rooms + If_elevator, data = cluster2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4292.4 -855.6 149.3 1121.5 3979.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11873.24 416.79 28.487 < 2e-16 ***
## Level 92.86 46.39 2.002 0.0463 *
## Rooms -304.72 122.71 -2.483 0.0136 *
## If_elevator 1601.13 213.99 7.482 9.89e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1528 on 275 degrees of freedom
## Multiple R-squared: 0.2571, Adjusted R-squared: 0.249
## F-statistic: 31.73 on 3 and 275 DF, p-value: < 2.2e-16
Final version of the model contains 3 statistically significant variables - Rooms, Level and If_elevator (significance level 5%). Adjusted R-squared with the value of 24,9% means that the model explains 24.9% total variability of dependent variable Price_m2. It is much higher than in the first model but based on literature the level is still low.
Analysis of the model (cluster_2)
The same analysis as for the first cluster is necessary to check whether the formula for second cluster was correctly specified. In this case we perform Ramsey Reset test. The p-value for our F-stat is 47.5%. Therefore, at 5% significance level, we fail to reject the Ramsey RESET test null hypothesis of correct specification. This indicates that the functional form is correct and our model does not suffer from omitted variables.
model_cluster_2_RESET <- resettest(model_cluster_2, power = 2, type = "fitted", data = properties_numeric)
model_cluster_2_RESET
##
## RESET test
##
## data: model_cluster_2
## RESET = 0.51251, df1 = 1, df2 = 274, p-value = 0.4747
As same as in the model for cluster_1 - we can assume that our residuals are normally distributed because of the size of the data set. In the last step we will check homoscedasticity of the residuals using Breusch-Pagan test.
bptest(model_cluster_2)
##
## studentized Breusch-Pagan test
##
## data: model_cluster_2
## BP = 13.179, df = 3, p-value = 0.004265
P-value in Breusch-Pagan test is lower than significance level (5%) so we reject the null hypothesis about homoscedasticity of the residuals in favor of alternative hypothesis - variance of residuals is heterogeneous.
The linear models satisfy Ramsey Reset test (the formula for both models is correct), and have randomly distributed residuals due to the size of the data specified. Both of the models fail in Breusch-Pagan test which means that we observe heteroscedasticity of the residuals. To sum up mentioned verifications, both of the models do not satisfy Gauss-Markov theorem so there are another models for every of our clusters that has lowest sampling variance within the class of linear unbiased estimators.
For both models power of explanation is rather low. In future improvements it might be necessary to add crucial variables for properties price analysis or add interaction of the variables.
model_cluster_1$coefficients
## (Intercept) Surface Rooms If_elevator
## 13783.38042 -17.30721 -389.28308 400.43818
model_cluster_2$coefficients
## (Intercept) Level Rooms If_elevator
## 11873.24242 92.85943 -304.71614 1601.12776
Based on estimated coefficients we can infer that price per squared meter of the property without direct access to the elevator is much higher in the first cluster. The price gap is even higher when we consider properties located on lower floor, having small amount of rooms and small surface (i.e. studio flat). On the other hand price of the property in the second cluster tends to be higher whereas the property is situated on the top floor and has direct access to the elevator.
At the end we present visualizations of values of Price_m2 vs values fitted by the models.
Model for cluster_1
model1.fitted <- fitted.values(model_cluster_1)
cluster1 %>% ggplot(aes(model1.fitted, Price_m2)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model 1")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Model for cluster_2
model2.fitted <- fitted.values(model_cluster_2)
cluster2 %>% ggplot(aes(model2.fitted, Price_m2)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model 2")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'