The real estate market plays a crucial role in economic growth, urban development, and investment strategies. Understanding its structure, pricing trends, and key influencing factors is essential for stakeholders, including buyers, investors, and policymakers. This project provides an in-depth analysis of the real estate market using statistical methods and data visualization techniques to extract meaningful insights from the available data.
The following milestones were outlined during design process:
Explatory Analysis: Understanding the format, types, and sources of data.
Data Cleansing and Wrangling: Ensuring data accuracy, handling missing values, and standardizing formats.
Outliers: Identifying and analyzing unusual price points that may affect overall trends.
Visualizations: Graphical representations to facilitate pattern recognition and trend analysis.
Descriptive Analysis: Exploring key metrics like mean, median, variance, and skewness, recovered aggregated statistical insights that describe the market dynamics.
Price Distribution Shape: Analyzing the distribution of prices through density plots and statistical summaries.
Statistical Tests: Assessing the statistical significance of median price differences across cities and investigating the impact of amenities on price variations.
Provide a clear understanding of real estate price structures.
Identify significant price differentiators and market trends.
Utilize statistical techniques to validate insights.
Present findings through comprehensive visual and tabular representations.
## Downloading necessary libraries
# Installing the libraries
if (!require(tidyverse)) install.packages("tidyverse")
if (!require(dplyr)) install.packages("dplyr")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(naniar)) install.packages("naniar")
if (!require(mice)) install.packages("mice")
if (!require(modelsummary)) install.packages("modelsummary")
if (!require(reshape2)) install.packages("reshape2")
if (!require(editrules)) install.packages("editrules")
if (!require(dlookr)) install.packages("dlookr")
if (!require(editrules)) install.packages("editrules")
if (!require(VIM)) install.packages("VIM")
if (!require(deducorrect)) install.packages("deducorrect")
if (!require(ISLR)) install.packages("ISLR")
if (!require(outliers)) install.packages("outliers")
if (!require(ggstatsplot)) install.packages("ggstatsplot")
if (!require(scales)) install.packages("scales")
if (!require(corrplot)) install.packages("corrplot")
if (!require(gridExtra)) install.packages("grid.extra")
if (!require(kableExtra)) install.packages("kable.extra")
if (!require(plyr)) install.packages("plyr")
if (!require(summarytools)) install.packages("summarytools")
if (!require(classInt)) install.packages("classInt")
if (!require(psych)) install.packages("psych")
if (!require(here)) install.packages("here")
if (!require(tidyr)) install.packages("tidyr")
if (!require(plotly)) install.packages("plotly")
if (!require(egg)) install.packages("egg")
# Attaching the libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(naniar)
library(mice)
library(modelsummary)
library(reshape2)
library(editrules)
library(dlookr)
library(editrules)
library(VIM)
library(deducorrect)
library(ISLR)
library(outliers)
library(ggstatsplot)
library(RColorBrewer)
library(viridis)
library(rlang)
library(scales)
library(corrplot)
library(gridExtra)
library(kableExtra)
library(plyr)
library(summarytools)
library(classInt)
library(psych)
library(here)
library(tidyr)
library(plotly)
library(egg)
The dataset contains apartment sales and rental offers from 15 largest cities in Poland (Warsaw, Lodz, Krakow, Wroclaw, Poznan, Gdansk, Szczecin, Bydgoszcz, Lublin, Katowice, Bialystok, Czestochowa). The data comes from local websites with apartments for sale. In order to fully capture the neighborhood of each apartment, each listing has been augmented with Open Street Map data with distances to points of interest (POIs). The data is collected monthly and covers the period from August 2023 to June 2024.
apartments_pl_YYY_MM.csv - monthly snapshot of sales listings
apartments_rent_pl_YYYY_MM.csv - monthly snapshot of rental offers
Important data manipulations conducted in this step are adding months of when the data come from to the dataset, and binding files. In the raw state of data, this date is only included in the name of the file, and also we have separate files for each month.
# Loading data
load_data <- function(name_struct) {
files_names =
# Listing files in the proper location and with given pattern
list.files(
path = file.path("./analiza_danych_projekt_zespolowy/Nieruchomosci w Polsce"),
pattern = name_struct,
full.names = TRUE)
myfiles = lapply(files_names, read.csv)
# adding column "month", labeling data from which month it was scrapped
for (i in seq_along(myfiles)) {myfiles[[i]]$month = to_date(files_names[i])}
# bind together dataframe
return(do.call(rbind, myfiles))
}
# adding new column to each DF with month
to_date <- function(primary_date) {
date_part <- sub(".*(\\d{4}_\\d{2}).*", "\\1", primary_date)
complete_date <- paste0(date_part, "_01")
parsed_date <-
as.Date(complete_date, format = "%Y_%m_%d")
return(parsed_date)
}
data_buy = load_data("^apartments_pl_[0-9]+_[0-9]+\\.csv$")
data_rent = load_data("^apartments_rent_pl_[0-9]+_[0-9]+\\.csv$")
Data cleansing and wrangling are essential steps in the data analysis process, ensuring that raw data is transformed into a structured, accurate, and usable format. Data cleansing involves among others handling missing values, outliers and identifying and correcting errors. Data wrangling focuses on reshaping, merging, and transforming datasets to make them suitable for analysis.
The goal of these processes is to enhance data integrity, eliminate inconsistencies, and prepare the dataset for meaningful insights.
Row IDs are unique and therefore provide no differentiation or analytical value. Additionally, due to the nature of the data, duplicate values are possible. For example, if a property was listed for sale in January but sold in March, it would appear three times—once for each month it was listed.
n_occur_buy <- data.frame(table(data_buy$id))
# n_occur_buy[n_occur_buy$Freq > 1,]
n_occur_rent <- data.frame(table(data_rent$id))
# n_occur_rent[n_occur_rent$Freq > 1,]
data_buy <- data_buy[!duplicated(data_buy$id),]
data_rent <- data_rent[!duplicated(data_rent$id),]
data_buy <- data_buy %>%
select(-id)
data_rent <- data_rent %>%
select(-id)
datasummary_skim(data_buy)
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| squareMeters | 7173 | 0 | 56.5 | 20.3 | 25.0 | 52.6 | 150.0 | |
| rooms | 6 | 0 | 2.6 | 0.9 | 1.0 | 2.0 | 6.0 | |
| floor | 28 | 17 | 3.4 | 2.6 | 1.0 | 3.0 | 29.0 | |
| floorCount | 30 | 1 | 5.5 | 3.4 | 1.0 | 4.0 | 29.0 | |
| buildYear | 166 | 17 | 1987.1 | 31.9 | 1850.0 | 1994.0 | 2024.0 | |
| latitude | 46449 | 0 | 52.0 | 1.3 | 50.0 | 52.2 | 54.6 | |
| longitude | 48233 | 0 | 19.6 | 1.7 | 14.4 | 19.9 | 23.2 | |
| centreDistance | 1487 | 0 | 4.6 | 2.8 | 0.0 | 4.3 | 16.9 | |
| poiCount | 196 | 0 | 19.9 | 22.8 | 0.0 | 14.0 | 212.0 | |
| schoolDistance | 2506 | 0 | 0.4 | 0.5 | 0.0 | 0.3 | 4.9 | |
| clinicDistance | 4264 | 0 | 1.0 | 0.9 | 0.0 | 0.7 | 5.0 | |
| postOfficeDistance | 2690 | 0 | 0.5 | 0.5 | 0.0 | 0.4 | 5.0 | |
| kindergartenDistance | 2310 | 0 | 0.4 | 0.4 | 0.0 | 0.3 | 5.0 | |
| restaurantDistance | 2407 | 0 | 0.3 | 0.4 | 0.0 | 0.2 | 5.0 | |
| collegeDistance | 4818 | 3 | 1.5 | 1.1 | 0.0 | 1.2 | 5.0 | |
| pharmacyDistance | 2421 | 0 | 0.4 | 0.4 | 0.0 | 0.2 | 5.0 | |
| price | 7287 | 0 | 770307.8 | 397438.1 | 150000.0 | 689000.0 | 3250000.0 | |
| N | % | |||||||
| city | bialystok | 1017 | 1.1 | |||||
| bydgoszcz | 3627 | 3.9 | ||||||
| czestochowa | 1242 | 1.3 | ||||||
| gdansk | 8244 | 8.9 | ||||||
| gdynia | 3569 | 3.8 | ||||||
| katowice | 2397 | 2.6 | ||||||
| krakow | 14186 | 15.3 | ||||||
| lodz | 7073 | 7.6 | ||||||
| lublin | 2299 | 2.5 | ||||||
| poznan | 3639 | 3.9 | ||||||
| radom | 845 | 0.9 | ||||||
| rzeszow | 762 | 0.8 | ||||||
| szczecin | 2487 | 2.7 | ||||||
| warszawa | 31982 | 34.4 | ||||||
| wroclaw | 9598 | 10.3 | ||||||
| type | 19792 | 21.3 | ||||||
| apartmentBuilding | 14947 | 16.1 | ||||||
| blockOfFlats | 46347 | 49.9 | ||||||
| tenement | 11881 | 12.8 | ||||||
| ownership | condominium | 83368 | 89.7 | |||||
| cooperative | 9592 | 10.3 | ||||||
| udział | 7 | 0.0 | ||||||
| buildingMaterial | 39179 | 42.1 | ||||||
| brick | 40522 | 43.6 | ||||||
| concreteSlab | 13266 | 14.3 | ||||||
| condition | 69914 | 75.2 | ||||||
| low | 10373 | 11.2 | ||||||
| premium | 12680 | 13.6 | ||||||
| hasParkingSpace | no | 67949 | 73.1 | |||||
| yes | 25018 | 26.9 | ||||||
| hasBalcony | no | 38958 | 41.9 | |||||
| yes | 54009 | 58.1 | ||||||
| hasElevator | 4448 | 4.8 | ||||||
| no | 43155 | 46.4 | ||||||
| yes | 45364 | 48.8 | ||||||
| hasSecurity | no | 82428 | 88.7 | |||||
| yes | 10539 | 11.3 | ||||||
| hasStorageRoom | no | 51899 | 55.8 | |||||
| yes | 41068 | 44.2 |
datasummary_skim(data_rent)
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| squareMeters | 3260 | 0 | 53.4 | 21.2 | 25.0 | 48.5 | 150.0 | |
| rooms | 6 | 0 | 2.3 | 0.8 | 1.0 | 2.0 | 6.0 | |
| floor | 28 | 12 | 3.4 | 2.6 | 1.0 | 3.0 | 30.0 | |
| floorCount | 31 | 2 | 5.7 | 3.4 | 1.0 | 5.0 | 30.0 | |
| buildYear | 149 | 28 | 1998.8 | 29.8 | 1850.0 | 2010.0 | 2024.0 | |
| latitude | 22850 | 0 | 51.7 | 1.2 | 50.0 | 52.2 | 54.6 | |
| longitude | 23409 | 0 | 19.7 | 1.7 | 14.4 | 20.0 | 23.2 | |
| centreDistance | 1302 | 0 | 3.9 | 2.6 | 0.0 | 3.5 | 16.6 | |
| poiCount | 188 | 0 | 24.2 | 26.2 | 0.0 | 16.0 | 210.0 | |
| schoolDistance | 1668 | 0 | 0.4 | 0.3 | 0.0 | 0.3 | 4.9 | |
| clinicDistance | 3163 | 0 | 0.8 | 0.7 | 0.0 | 0.6 | 5.0 | |
| postOfficeDistance | 1879 | 0 | 0.5 | 0.4 | 0.0 | 0.4 | 4.9 | |
| kindergartenDistance | 1465 | 0 | 0.3 | 0.3 | 0.0 | 0.3 | 4.8 | |
| restaurantDistance | 1456 | 0 | 0.3 | 0.3 | 0.0 | 0.2 | 5.0 | |
| collegeDistance | 4154 | 1 | 1.3 | 1.0 | 0.0 | 1.0 | 5.0 | |
| pharmacyDistance | 1490 | 0 | 0.3 | 0.3 | 0.0 | 0.2 | 5.0 | |
| price | 996 | 0 | 3682.3 | 2186.0 | 346.0 | 3000.0 | 23000.0 | |
| N | % | |||||||
| city | bialystok | 223 | 0.6 | |||||
| bydgoszcz | 731 | 1.9 | ||||||
| czestochowa | 345 | 0.9 | ||||||
| gdansk | 1930 | 5.1 | ||||||
| gdynia | 769 | 2.0 | ||||||
| katowice | 1453 | 3.8 | ||||||
| krakow | 7366 | 19.4 | ||||||
| lodz | 2647 | 7.0 | ||||||
| lublin | 632 | 1.7 | ||||||
| poznan | 2217 | 5.8 | ||||||
| radom | 115 | 0.3 | ||||||
| rzeszow | 310 | 0.8 | ||||||
| szczecin | 719 | 1.9 | ||||||
| warszawa | 14785 | 39.0 | ||||||
| wroclaw | 3699 | 9.7 | ||||||
| type | 9190 | 24.2 | ||||||
| apartmentBuilding | 12495 | 32.9 | ||||||
| blockOfFlats | 12590 | 33.2 | ||||||
| tenement | 3666 | 9.7 | ||||||
| condominium | 37941 | 100.0 | ||||||
| buildingMaterial | 16157 | 42.6 | ||||||
| brick | 19939 | 52.6 | ||||||
| concreteSlab | 1845 | 4.9 | ||||||
| condition | 27790 | 73.2 | ||||||
| low | 123 | 0.3 | ||||||
| premium | 10028 | 26.4 | ||||||
| hasParkingSpace | no | 23159 | 61.0 | |||||
| yes | 14782 | 39.0 | ||||||
| hasBalcony | no | 14505 | 38.2 | |||||
| yes | 23436 | 61.8 | ||||||
| hasElevator | 2096 | 5.5 | ||||||
| no | 12310 | 32.4 | ||||||
| yes | 23535 | 62.0 | ||||||
| hasSecurity | no | 32234 | 85.0 | |||||
| yes | 5707 | 15.0 | ||||||
| hasStorageRoom | no | 31784 | 83.8 | |||||
| yes | 6157 | 16.2 |
A brief exploration of the dataset reveals that both the rent and buy datasets share the same 28 features, which include both categorical and continuous variables. The buy dataset contains 92 967 unique records, while the rent dataset has 37 941 unique records. An interesting observation is that the data includes not only missing values (NAs) but also different type of blank entries, which will be addressed in the next section.
# Checking the nature of "other" blanks and replacing them with NA
data_buy[5,"buildingMaterial"]
## [1] ""
data_buy[data_buy==""] <- NA
data_rent[data_rent==""] <- NA
knitr::kable(data.frame(miss_var_summary(data_buy)))
| variable | n_miss | pct_miss |
|---|---|---|
| condition | 69914 | 75.2 |
| buildingMaterial | 39179 | 42.1 |
| type | 19792 | 21.3 |
| floor | 15984 | 17.2 |
| buildYear | 15641 | 16.8 |
| hasElevator | 4448 | 4.78 |
| collegeDistance | 2492 | 2.68 |
| floorCount | 1082 | 1.16 |
| clinicDistance | 333 | 0.358 |
| restaurantDistance | 226 | 0.243 |
| pharmacyDistance | 128 | 0.138 |
| postOfficeDistance | 99 | 0.106 |
| kindergartenDistance | 85 | 0.0914 |
| schoolDistance | 60 | 0.0645 |
| city | 0 | 0 |
| squareMeters | 0 | 0 |
| rooms | 0 | 0 |
| latitude | 0 | 0 |
| longitude | 0 | 0 |
| centreDistance | 0 | 0 |
| poiCount | 0 | 0 |
| ownership | 0 | 0 |
| hasParkingSpace | 0 | 0 |
| hasBalcony | 0 | 0 |
| hasSecurity | 0 | 0 |
| hasStorageRoom | 0 | 0 |
| price | 0 | 0 |
| month | 0 | 0 |
knitr::kable(data.frame(miss_var_summary(data_rent)))
| variable | n_miss | pct_miss |
|---|---|---|
| condition | 27790 | 73.2 |
| buildingMaterial | 16157 | 42.6 |
| buildYear | 10570 | 27.9 |
| type | 9190 | 24.2 |
| floor | 4646 | 12.2 |
| hasElevator | 2096 | 5.52 |
| floorCount | 730 | 1.92 |
| collegeDistance | 490 | 1.29 |
| restaurantDistance | 82 | 0.216 |
| pharmacyDistance | 38 | 0.100 |
| clinicDistance | 32 | 0.0843 |
| kindergartenDistance | 28 | 0.0738 |
| postOfficeDistance | 19 | 0.0501 |
| schoolDistance | 11 | 0.0290 |
| city | 0 | 0 |
| squareMeters | 0 | 0 |
| rooms | 0 | 0 |
| latitude | 0 | 0 |
| longitude | 0 | 0 |
| centreDistance | 0 | 0 |
| poiCount | 0 | 0 |
| ownership | 0 | 0 |
| hasParkingSpace | 0 | 0 |
| hasBalcony | 0 | 0 |
| hasSecurity | 0 | 0 |
| hasStorageRoom | 0 | 0 |
| price | 0 | 0 |
| month | 0 | 0 |
Both data sets have some serious problems with missing data, especially in condition and building material.
Data set ‘buy’
# Upset plot of missing data and number of missings by a city - buy dataset
NA_vector_buy <- c("floor", "buildYear", "collegeDistance", "floorCount",
"clinicDistance", "restaurantDistance", "pharmacyDistance",
"postOfficeDistance", "kindergartenDistance", "schoolDistance",
"condition", "buildingMaterial", "type", "hasElevator")
gg_miss_upset(data_buy, nsets=14, mb.ratio = c(0.5, 0.5), text.scale = c(2.5,2.5,2.5,2.5,1.9,1.5),
set_size.numbers_size=TRUE, set_size.show = TRUE)
gg_miss_fct(data_buy[,c(NA_vector_buy, "city")], fct = city) +
theme(text = element_text(size = 25))
An analysis of missing values in the ‘buy’ dataset using an upset plot reveals some connections between condition, building material, and type. However, given the scale of missing values in these features, their mutual occurrence does not appear to be significant. On the other hand, in Gdańsk and Gdynia, the ‘condition’ attribute is missing more frequently than in other cities, while in Częstochowa, the ‘build year’ is notably absent more often.
Data set ‘rent’
# Upset plot of missing data and number of missings by a city - rent dataset
NA_vector_rent <- c("condition", "floor", "buildYear", "floorCount", "collegeDistance",
"clinicDistance", "restaurantDistance", "pharmacyDistance",
"postOfficeDistance", "kindergartenDistance", "schoolDistance")
gg_miss_upset(data_rent, nsets=10, mb.ratio = c(0.5, 0.5), text.scale = c(2.5,2.5,2.5,2.5,1.9,1.5),
set_size.numbers_size=TRUE, set_size.show = TRUE)
gg_miss_fct(data_rent[,c(NA_vector_rent, "city")], fct = city) +
theme(text = element_text(size = 25))
A similar pattern is observed in the ‘rent’ dataset. Once again, some connections between condition, building material, and type are visible, but they do not appear to be significant. In this case, missing condition data is noticeable not only in Gdańsk and Gdynia but also in Częstochowa, where the ‘build year’ attribute is also frequently missing.
In the next steps will be checked correlation between NAs and other variable, for this purpose we will conduct below steps:
code NAs to 1 (in case of NA) and 0
code binary, categorical variables
exclude non binar variables
# Coding NAs
NA_data_buy = data_buy
NA_data_rent = data_rent
for (i in 1:length(NA_vector_buy)) {
col_name <- paste("NA",NA_vector_buy[i], sep="_")
NA_data_buy[[col_name]] <- ifelse(is.na(NA_data_buy[[NA_vector_buy[i]]]), 1, 0)
}
for (i in 1:length(NA_vector_rent)) {
col_name <- paste("NA",NA_vector_rent[i], sep="_")
NA_data_rent[[col_name]] <- ifelse(is.na(NA_data_rent[[NA_vector_rent[i]]]), 1, 0)
}
# Coding categorical variables
NA_data_buy$condition <-
NA_data_buy$condition %>%
ordered(c("low", "premium")) %>%
as.numeric()-1
NA_data_rent$condition <-
NA_data_rent$condition %>%
ordered(c("low", "premium")) %>%
as.numeric()-1
yes_no_variables <- c("hasParkingSpace", "hasBalcony", "hasElevator",
"hasSecurity", "hasStorageRoom")
for (i in 1:length(yes_no_variables)){
NA_data_buy[[yes_no_variables[i]]] <-
NA_data_buy[[yes_no_variables[i]]] %>%
ordered(c("no", "yes")) %>%
as.numeric()-1
NA_data_rent[[yes_no_variables[i]]] <-
NA_data_rent[[yes_no_variables[i]]] %>%
ordered(c("no", "yes")) %>%
as.numeric()-1
}
# vector of features to exclude
exclude_HM <- c("city", "type", "ownership", "buildingMaterial", "month",
"condition", "hasElevator")
# Heat maps
selected_features1 <- names(NA_data_buy)[startsWith(names(NA_data_buy), "NA_")]
selected_features2 <- names(NA_data_buy)[!startsWith(names(NA_data_buy), "NA_")]
NA_data_buy[,!names(NA_data_buy) %in% exclude_HM] %>%
cor(use = "pairwise.complete.obs") %>%
round(digits=2) %>%
melt() %>%
filter(Var1 %in% selected_features2 & Var2 %in% selected_features1) %>%
ggplot(aes(x=Var1, y=Var2, fill=value)) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Correlation matrix with NA - 'buy' dataset") +
scale_fill_gradient2(low = "red", mid = "#FFDD94", high = "green", midpoint = 0)+
geom_tile() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(text = element_text(size = 25))
selected_features1 <- names(NA_data_rent)[startsWith(names(NA_data_rent), "NA_")]
selected_features2 <- names(NA_data_rent)[!startsWith(names(NA_data_rent), "NA_")]
NA_data_rent[, !names(NA_data_rent) %in% exclude_HM] %>%
cor(use = "pairwise.complete.obs") %>%
round(digits = 2) %>%
melt() %>%
filter(Var1 %in% selected_features2 & Var2 %in% selected_features1) %>%
ggplot(aes(x = Var1, y = Var2, fill = value)) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Correlation matrix with NA - 'rent' dataset") +
scale_fill_gradient2(low = "red", mid = "#FFDD94", high = "green", midpoint = 0) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
theme(text = element_text(size = 25))
Correlation matrix doesn’t show any strong dependencies between missing and other variables
For finding outliers we used couple methods:
Grubbs’ test
boxplots
Interqunatile Range Method - Q-Q plot
We paid special attention to price since this is the subject of our research. Grubbs’ test allows us defining features with potential outliers:
categorical_var <- c("city", "type", "ownership", "buildingMaterial", "condition", "hasParkingSpace", "hasBalcony", "hasElevator", "hasSecurity", "hasStorageRoom", "id", "month")
grubbs_test_buy <-
sapply(data_buy[,!names(data_buy) %in% categorical_var], grubbs.test)
var_buy <- cbind(grubbs_test_buy["p.value",][grubbs_test_buy["p.value",] < 0.05])
knitr::kable(var_buy, caption = "Variables with p-value lower than 5% (buy data set)", col.names = c("Variable", "p-value"))
| Variable | p-value |
|---|---|
| floor | 0 |
| floorCount | 1.00849896211841e-07 |
| poiCount | 0 |
| schoolDistance | 0 |
| postOfficeDistance | 0 |
| kindergartenDistance | 0 |
| restaurantDistance | 0 |
| pharmacyDistance | 0 |
| price | 2.03623467800451e-05 |
grubbs_test_rent <-
sapply(data_rent[,!names(data_rent) %in% categorical_var], grubbs.test)
var_rent <- cbind(grubbs_test_rent["p.value",][grubbs_test_rent["p.value",] < 0.05])
knitr::kable(var_rent, caption = "Variables with p-value lower than 5% (rent data set)", col.names = c("Variable", "p-value"))
| Variable | p-value |
|---|---|
| floor | 0 |
| floorCount | 2.41967365033346e-08 |
| buildYear | 0.0077240047029381 |
| centreDistance | 0.0210443044676376 |
| poiCount | 2.45829663292341e-08 |
| schoolDistance | 0 |
| clinicDistance | 0.000174023317614402 |
| postOfficeDistance | 0 |
| kindergartenDistance | 0 |
| restaurantDistance | 0 |
| pharmacyDistance | 0 |
| price | 0 |
The strongest outliers are suspected in floor, price, and proximity-related variables (schoolDistance, postOfficeDistance, etc.), suggesting unusually high or low values in these attributes. Outliers in rental prices could indicate luxury apartments or extremely cheap listings. Floor-related outliers might suggest penthouse apartments or ground-floor units with unusual characteristics. The rental data set has more variables with outlier candidates.
Outliers visualization for the ‘buy’ data set
# boxplots buy
bp1 <- ggplot(pivot_longer(data_buy[c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")], cols=c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")), aes(name, value)) +
geom_boxplot() +
ggtitle("Public destination distance outliers") +
labs(x="", y="Miles") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))+
scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
theme(text = element_text(size = 25))
bp2 <- ggplot(pivot_longer(data_buy[c("floor", "floorCount")], cols=c("floor", "floorCount")), aes(name, value)) +
geom_boxplot() +
ggtitle("Floor and floor count outliers") +
labs(x="", y="Count") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
bp3 <- ggplot(pivot_longer(data_buy[c("poiCount")], cols="poiCount"), aes(name, value)) +
geom_boxplot() +
ggtitle("PoI count outliers") +
labs(x="", y="Count") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
bp4 <- ggplot(pivot_longer(data_buy[c("price")], cols="price"), aes(name, value)) +
geom_boxplot() +
ggtitle("Price outliers") +
labs(x="", y="PLN") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
ggarrange(bp1, bp2, bp3, bp4, ncol = 2, nrow = 2)
Outliers visualization for the ‘rent’ data set
# boxplots rent
bp1 <- ggplot(pivot_longer(data_rent[c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")], cols=c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")), aes(name, value)) +
geom_boxplot() +
ggtitle("Public destination distance outliers") +
labs(x="", y="Miles") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))+
scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
theme(text = element_text(size = 25))
bp2 <- ggplot(pivot_longer(data_rent[c("floor", "floorCount")], cols=c("floor", "floorCount")), aes(name, value)) +
geom_boxplot() +
ggtitle("Floor and floor count outliers") +
labs(x="", y="Count") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
bp3 <- ggplot(pivot_longer(data_rent[c("poiCount")], cols="poiCount"), aes(name, value)) +
geom_boxplot() +
ggtitle("PoI count outliers") +
labs(x="", y="Count") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
bp4 <- ggplot(pivot_longer(data_rent[c("price")], cols="price"), aes(name, value)) +
geom_boxplot() +
ggtitle("Price outliers") +
labs(x="", y="PLN") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
ggarrange(bp1, bp2, bp3, bp4, ncol = 2, nrow = 2)
Some apartments are located in remote areas with poor access to these public facilities. Outliers in floor and floorCount variables indicate penthouses or very high-floor apartments in skyscrapers. High value of PoI suggests that some apartments are located in highly urbanized areas with dense amenities, while high prices indicate luxury apartments or highly premium properties. The boxplots exhibit a dense concentration of outliers, with no observations noticeably separated from the rest. Above mentioned suggests that the detected outliers are not due to erroneous data but rather a result of a fat-tailed distribution in the given features.
# Interqunatile Range Method for price outliers
buy_Q1 <- quantile(data_buy$price, 0.25, na.rm = TRUE)
buy_Q3 <- quantile(data_buy$price, 0.75, na.rm = TRUE)
buy_IQR_value <- buy_Q3 - buy_Q1
buy_lower_bound <- max(0, buy_Q1 - 1.5 * buy_IQR_value) # Adjust lower bound to 0, because price cannot be negative
buy_upper_bound <- buy_Q3 + 1.5 * buy_IQR_value
rent_Q1 <- quantile(data_rent$price, 0.25, na.rm = TRUE)
rent_Q3 <- quantile(data_rent$price, 0.75, na.rm = TRUE)
rent_IQR_value <- rent_Q3 - rent_Q1
rent_lower_bound <- max(0, rent_Q1 - 1.5 * rent_IQR_value) # Adjust lower bound to 0, because price cannot be negative
rent_upper_bound <- rent_Q3 + 1.5 * rent_IQR_value
# Identify price outliers
buy_outliers <-
data_buy[data_buy$price < buy_lower_bound | data_buy$price > buy_upper_bound,]
rent_outliers <-
data_rent[data_rent$price < rent_lower_bound | data_rent$price > rent_upper_bound,]
# See how many values are outliers related to the original dataset
print(paste("Price outliers related to the original dataset - buy ", as.character(round((length(buy_outliers$price)/length(data_buy$price)) * 100, 2)), "%", sep=""))
## [1] "Price outliers related to the original dataset - buy 6.09%"
print(paste("Price outliers related to the original dataset - rent ", as.character(round(length(rent_outliers$price)/length(data_rent$price) * 100, 2)), "%", sep=""))
## [1] "Price outliers related to the original dataset - rent 7.57%"
# Taking a closer look at the rent outliers
knitr::kable(data_rent[data_rent$price == min(data_rent$price), ][c("city", "price", "squareMeters", "poiCount")])
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 30685 | katowice | 346 | 30.36 | 4 |
knitr::kable(data_rent[data_rent$price == max(data_rent$price), ][c("city", "price", "squareMeters", "poiCount")])
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 32075 | warszawa | 23000 | 148.0 | 62 |
| 33121 | warszawa | 23000 | 147.6 | 60 |
| 33521 | warszawa | 23000 | 126.0 | 100 |
While 346 PLN for rent may seem low, there are multiple observations close to this price. There are 3 observations with price of 23 000 PLN, all located in the center of Warsaw, where there is a high density of points of interest (POI).
# Taking a closer look at the buy outlier
knitr::kable(data_buy[data_buy$price == min(data_buy$price), ][c("city", "price", "squareMeters", "poiCount")])
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 18202 | bydgoszcz | 150000 | 31.02 | 8 |
| 18206 | bydgoszcz | 150000 | 35.94 | 8 |
knitr::kable(data_buy[data_buy$price == max(data_buy$price), ][c("city", "price", "squareMeters", "poiCount")])
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 79943 | warszawa | 3250000 | 133.51 | 9 |
| 80351 | warszawa | 3250000 | 136.00 | 2 |
| 84104 | warszawa | 3250000 | 133.51 | 13 |
| 129837 | warszawa | 3250000 | 131.70 | 26 |
For apartments listed for sale, neither a price of 1,500,000 PLN nor 3,250,000 PLN appears alarming or unusual. There are multiple listings with prices close to these values, suggesting they are within a reasonable range.
An additional confirmation of the price distribution would be a Q-Q plot. Since apartment prices are typically log-normally distributed, a Q-Q plot of log-transformed rent prices would help assess their conformity to this distribution.
qqnorm(log(data_buy$price), main = "Normal Q-Q Plot ('buy' data set)", cex.lab=3, cex.main=3, cex.axis=3)
qqline(log(data_buy$price))
qqnorm(log(data_rent$price), main = "Normal Q-Q Plot ('rent' data set)", cex.lab=3, cex.main=3, cex.axis=3)
qqline(log(data_rent$price))
In our case distributions are fat-tailed.
Data validation ensures data correctness by verifying logical consistency and identifying anomalies. In this data set, we checked the following properties:
Floor number is less than or equal to the total floor count.
Build year is less than 2024.
Latitude and longitude fall within Poland’s geographic range.
# Data validation
RULE <- editfile("RULES")
violated_buy <- violatedEdits(RULE, data_buy)
# summary(violated_buy)
RULE <- editfile("RULES")
violated_rent <- violatedEdits(RULE, data_rent)
# summary(violated_rent)
Since there are no incorrect data or outliers, we can proceed with imputing missing values using the MICE method, specifically the classification and regression trees (CART) variation. First, we encode text and categorical variables to make them suitable for imputation. Next, we perform the imputation, verify its success, and finally decode the variables back to their original format.
To avoid retraining the model each time, we will save the imputation results and disable the relevant cells from running in every Markdown execution.
# coding data - buy dataset
data_buy$condition <-
data_buy$condition %>%
ordered(c("low", "premium")) %>%
as.numeric()-1
data_buy$buildingMaterial <-
data_buy$buildingMaterial %>%
ordered(c("brick", "concreteSlab")) %>%
as.numeric()-1
data_buy$type <-
data_buy$type %>%
ordered(c("apartmentBuilding", "blockOfFlats", "tenement")) %>%
as.numeric()-1
data_buy$hasElevator <-
data_buy$hasElevator %>%
ordered(c("no", "yes")) %>%
as.numeric()-1
# MICE imputation
imp_buy <- mice(data_buy, m=3, maxit=3, method="cart")
# Pool results into final dataset
complete_data_buy <- complete(imp_buy)
# evaluating imputation results
data.frame(miss_var_summary(complete_data_buy))
Data was successfully imputed
# decoding variables
complete_data_buy$condition <- ifelse(complete_data_buy$condition == 0, "low", "premium")
complete_data_buy$buildingMaterial <- ifelse(complete_data_buy$buildingMaterial == 0, "brick", "concreteSlab")
complete_data_buy$type <- ifelse(complete_data_buy$type == 0, "apartmentBuilding",
ifelse(complete_data_buy$type == 1, "blockOfFlats", "tenement"))
complete_data_buy$hasElevator <- ifelse(complete_data_buy$hasElevator == 0, "no", "yes")
# saving imputed data
write.csv(complete_data_buy,"clean_buy.csv")
# coding vriables - rent dataset
data_rent$condition <-
data_rent$condition %>%
ordered(c("low", "premium")) %>%
as.numeric()-1
data_rent$buildingMaterial <-
data_rent$buildingMaterial %>%
ordered(c("brick", "concreteSlab")) %>%
as.numeric()-1
data_rent$type <-
data_rent$type %>%
ordered(c("apartmentBuilding", "blockOfFlats", "tenement")) %>%
as.numeric()-1
data_rent$hasElevator <-
data_rent$hasElevator %>%
ordered(c("no", "yes")) %>%
as.numeric()-1
# Multiple data imputation
imp_rent <- mice(data_rent, m=3, maxit=3, method="cart")
# Pool results into final dataset
complete_data_rent <- complete(imp_rent)
# evaluating imputation
data.frame(miss_var_summary(complete_data_rent))
# decoding variables
complete_data_rent$condition <- ifelse(complete_data_rent$condition == 0, "low", "premium")
complete_data_rent$buildingMaterial <- ifelse(complete_data_rent$buildingMaterial == 0, "brick", "concreteSlab")
complete_data_rent$type <- ifelse(complete_data_rent$type == 0, "apartmentBuilding",
ifelse(complete_data_rent$type == 1, "blockOfFlats", "tenement"))
complete_data_rent$hasElevator <- ifelse(complete_data_rent$hasElevator == 0, "no", "yes")
# saving imputed data
write.csv(complete_data_rent,"clean_rent.csv")
complete_data_buy <- read.csv("clean_buy.csv")
complete_data_buy <- complete_data_buy %>%
select(-X)
complete_data_rent <- read.csv("clean_rent.csv")
complete_data_rent <- complete_data_rent %>%
select(-X)
# or
# complete_data_buy <- complete_data_buy
# complete_data_rent <- complete_data_rent
total_median_buy <- median(complete_data_buy$price)
plot_median_buy <- complete_data_buy %>%
group_by(city) %>%
summarise(median_price = median(price)) %>%
ggplot(aes(x = reorder(city, median_price),
y = median_price,
fill = median_price > total_median_buy)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("FALSE" = "#90EE90", "TRUE" = "#ffadad"), guide = "none") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Median Purchase Price by City",
x = "City",
y = "Median Price") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
total_median_rent <- median(complete_data_rent$price)
plot_median_rent <- complete_data_rent %>%
group_by(city) %>%
summarise(median_price = median(price)) %>%
ggplot(aes(x = reorder(city, median_price),
y = median_price,
fill = median_price > total_median_rent)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("FALSE" = "#90EE90", "TRUE" = "#ffadad"), guide = "none") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Median Rent Price by City",
x = "City",
y = "Median Price") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
grid.arrange(plot_median_buy, plot_median_rent, ncol = 2)
There are highest prices for housing purchasing in the largest Polish cities as it follows from the graph, where in red marked cities with median prices above overall median price. The highest prices are in Warsaw, Krakow and Gdansk, while the cheapest housings are in Czestochowa, Radom and Bydgoszcz. The rent prices are a little more close in terms of the median values, nevertheless there is an outlier - capital city Warsaw, where the median is over 30% higher than in other cities.
Purchase data set
ggplot(complete_data_buy, aes(x = squareMeters, y = price)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", color = "red") +
facet_wrap(~city) +
labs(title = "Appartemnt Price vs Size by City") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
Buy data set
ggplot(complete_data_rent, aes(x = squareMeters, y = price)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", color = "red") +
facet_wrap(~city) +
labs(title = "Appartemnt Price vs Size by City") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
Buy data set
complete_data_buy %>%
gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity, )) %>%
ggplot(aes(x = has_amenity, y = price)) +
geom_boxplot(fill = "lightblue") +
facet_wrap(~amenity) +
labs(title = "Price Distribution by Amenities") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
Rental data set
complete_data_rent %>%
gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity, )) %>%
ggplot(aes(x = has_amenity, y = price)) +
geom_boxplot(fill = "lightblue") +
facet_wrap(~amenity) +
labs(title = "Price Distribution by Amenities") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
The presence of amenities increases the price of housing, while there is an outlier amenity - storage room: the median price for housings with a storage room is lower than without it. The difference in price for the categories is plotted on the next graph.
pl1 <- complete_data_buy %>%
gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity)) %>%
group_by(amenity) %>%
mutate(
avg_no = mean(price[has_amenity == "no"]),
avg_yes = mean(price[has_amenity == "yes"]),
pct_diff = (avg_yes - avg_no) / avg_no * 100
) %>%
ggplot(aes(x = reorder(amenity, pct_diff), y = pct_diff, fill = pct_diff > 0)) +
geom_col() +
scale_fill_manual(values = c("FALSE" = "#ffadad", "TRUE" = "#90EE90"), guide = "none") +
coord_flip() +
labs(
title = "Purchase data",
x = "Amenity",
y = "Price Difference (%)",
fill = "Price Impact"
) +
theme_minimal() +
scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
pl2 <- complete_data_rent %>%
gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity)) %>%
group_by(amenity) %>%
mutate(
avg_no = mean(price[has_amenity == "no"]),
avg_yes = mean(price[has_amenity == "yes"]),
pct_diff = (avg_yes - avg_no) / avg_no * 100
) %>%
ggplot(aes(x = reorder(amenity, pct_diff), y = pct_diff, fill = pct_diff > 0)) +
geom_col() +
scale_fill_manual(values = c("FALSE" = "#ffadad", "TRUE" = "#90EE90"), guide = "none") +
coord_flip() +
labs(
title = "Rental data",
x = "Amenity",
y = "Price Difference (%)",
fill = "Price Impact"
) +
theme_minimal() +
scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
grid.arrange(pl1+theme(legend.position="top"),
pl2+theme(legend.position="top"),
ncol = 2,
top =textGrob("Price Premium for Amenities",gp=gpar(fontsize=35)))
Amenities (all but storageRoom) seem to have a noticeable positive impact on price of the apartment. A lower purchase price for housings with storage rooms may be caused by hidden additional costs, which means that buyers are forced to pay extra for a storage room. The phenomena requires further analysis with broader range of data.
Purchase data set
distance_vars_buy <- select(complete_data_buy, contains("Distance"), price)
correlation_matrix_buy <- cor(distance_vars_buy, use = "complete.obs")
corrplot(correlation_matrix_buy, method = "color", addCoef.col = 1,
tl.cex = 1.8, number.cex = 1.8)
There is no strong correlation found between the level of prices and distance to public places. However there is pretty solid positive correlation between variables of distances to public places.
Rental data set
distance_vars_rent <- select(complete_data_rent, contains("Distance"), price)
correlation_matrix_rent <- cor(distance_vars_rent, use = "complete.obs")
corrplot(correlation_matrix_rent, method = "color", addCoef.col = 1,
tl.cex = 1.8, number.cex = 1.8)
Similar trends are observed on a rent market data set.
plot_age_buy <- ggplot(complete_data_buy, aes(x = buildYear, y = price)) +
geom_smooth() +
labs(title = "Purchase data") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
plot_age_rent <- ggplot(complete_data_rent, aes(x = buildYear, y = price)) +
geom_smooth() +
labs(title = "Rental data") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
grid.arrange(plot_age_buy+theme(legend.position="top"),
plot_age_rent+theme(legend.position="top"),
ncol = 2,
top =textGrob("Price vs Building Age",gp=gpar(fontsize=35)))
Interesting things can be observed – model is less certain about the smooth price for apartments for rent built in the 70s, which might suggest there’s less of them available for rent. This phenomena is correlated with modern market rules, as the demand on the new apartments is higher. The apartments in very old buildings tend to be very expensive, but there are not many of them. Commie blocks are the least expensive for both rent and purchase. Out of modern buildings those built around 2508 seem to be the most expensive. The prices for the newest buildings seem to be falling with each year.
pl1 <- ggplot(complete_data_buy, aes(x = condition, y = price)) +
geom_violin(fill = "beige") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Purchase data") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
pl2 <- ggplot(complete_data_rent, aes(x = condition, y = price)) +
geom_violin(fill = "beige") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Rental data") +
scale_y_continuous(labels = label_comma()) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
grid.arrange(pl1+theme(legend.position="top"),
pl2+theme(legend.position="top"),
ncol = 2,
top =textGrob("Price Distribution by Property Condition",gp=gpar(fontsize=35)))
There is no significant influence of housing condition on the price level, nevertheless its visible that premium housings are a little bit more expensive. It can be noticed that the figure of premium condition has a more curvy shape, which can be explained by larger number of observation.
data_buy <- complete_data_buy
data_rent <- complete_data_rent
data_buy <- data_buy %>%
mutate(price_per_m2 = price / squareMeters)
data_rent <- data_rent %>%
mutate(price_per_m2 = price / squareMeters)
print(dfSummary(data_buy, na.col=FALSE, headings=FALSE, valid.col=FALSE),
method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | city [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | type [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | squareMeters [numeric] |
|
7173 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | rooms [integer] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | floor [integer] |
|
27 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | floorCount [integer] |
|
29 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | buildYear [integer] |
|
165 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | latitude [numeric] |
|
46449 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | longitude [numeric] |
|
48233 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | centreDistance [numeric] |
|
1487 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | poiCount [integer] |
|
196 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | schoolDistance [numeric] |
|
2505 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | clinicDistance [numeric] |
|
4263 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | postOfficeDistance [numeric] |
|
2689 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | kindergartenDistance [numeric] |
|
2309 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | restaurantDistance [numeric] |
|
2406 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | collegeDistance [numeric] |
|
4817 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | pharmacyDistance [numeric] |
|
2420 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | ownership [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | buildingMaterial [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | condition [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | hasParkingSpace [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | hasBalcony [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | hasElevator [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | hasSecurity [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | hasStorageRoom [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | price [integer] |
|
7287 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | month [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | price_per_m2 [numeric] |
|
47186 distinct values |
Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-10
Summary table contains information about every column in the purchase data set. It can be noticed from the table that
print(dfSummary(data_rent, na.col=FALSE, headings=FALSE, valid.col=FALSE),
method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | city [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | type [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | squareMeters [numeric] |
|
3260 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | rooms [integer] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | floor [integer] |
|
27 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | floorCount [integer] |
|
30 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | buildYear [integer] |
|
148 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | latitude [numeric] |
|
22850 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | longitude [numeric] |
|
23409 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | centreDistance [numeric] |
|
1302 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | poiCount [integer] |
|
188 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | schoolDistance [numeric] |
|
1667 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | clinicDistance [numeric] |
|
3162 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | postOfficeDistance [numeric] |
|
1878 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | kindergartenDistance [numeric] |
|
1464 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | restaurantDistance [numeric] |
|
1455 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | collegeDistance [numeric] |
|
4153 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | pharmacyDistance [numeric] |
|
1489 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | ownership [character] | 1. condominium |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | buildingMaterial [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | condition [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | hasParkingSpace [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | hasBalcony [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | hasElevator [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | hasSecurity [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | hasStorageRoom [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | price [integer] |
|
996 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | month [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | price_per_m2 [numeric] |
|
9600 distinct values |
Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-10
Rental market data in general follow the same trend, however there are some differences:
# range(data_buy$price)
limits <- seq(min(data_buy$price)-500000, max(data_buy$price)+500000, by=100000)
limits_factor <- cut(data_buy$price, limits)
table_price <- table(limits_factor, dnn = c("Limits"))
# transform(table_price, Percentage=prop.table(Freq)) #, Cum_Freq=cumsum(Freq))
hist(data_buy$price, prob=TRUE, breaks=limits, main="Price groupping for purchase market", xlab="Price in PLN", col="lightblue", cex.lab=3, cex.main=3, cex.axis=3)
lines(density(data_buy$price), col=2)
print("TAI (tabular accuracy index)")
## [1] "TAI (tabular accuracy index)"
tab_accuracy <- classIntervals(data_buy$price, n=10, style="fixed", fixedBreaks=limits)
#tabular accuracy index
jenks.tests(tab_accuracy)
## # classes Goodness of fit Tabular accuracy
## 41.0000000 0.9947621 0.9148005
pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3', 'red', 'yellow', 'brown', 'black')
colcode=findColours(tab_accuracy, pcol)
plot(tab_accuracy, pal=pcol, ylab='Cumulative %', cex.lab=3, cex.main=3, cex.axis=3)
The price has skewed distribution distribution. A histogram was build to show how many housings fall in certain price levels from minimum price 150000 up to the maximum 3250000 by 100000 PLN. The histogram shows that most represented category is 600-700 thousands PLN. The partition accuracy was checked with the tabular accuracy index, which shows high tabular accuracy and goodness of fit.
# range(data_rent$price)
limits <- seq(min(data_rent$price) - 300, max(data_rent$price) + 300, by=500)
limits_factor <- cut(data_rent$price, limits)
table_price <- table(limits_factor, dnn = c("Limits"))
# transform(table_price, Percentage=prop.table(Freq)) #, Cum_Freq=cumsum(Freq))
hist(data_rent$price, prob=TRUE, breaks=limits, main="Price groupping for rental market", xlab="Price in PLN", col="lightblue", cex.lab=3, cex.main=3, cex.axis=3)
lines(density(data_rent$price), col=2)
print("TAI (tabular accuracy index)")
## [1] "TAI (tabular accuracy index)"
tab_accuracy <- classIntervals(data_rent$price, n=10, style="fixed", fixedBreaks=limits)
#tabular accuracy index
jenks.tests(tab_accuracy)
## # classes Goodness of fit Tabular accuracy
## 46.0000000 0.9959035 0.9154236
pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3', 'red', 'yellow', 'brown', 'black')
colcode=findColours(tab_accuracy, pcol)
plot(tab_accuracy, pal=pcol, ylab='Cumulative %', cex.lab=3, cex.main=3, cex.axis=3)
The shapes of the rental market distribution is narrower than purchase market one, which can be explained by the target groups of these offers (often these group cannot afford to buy a housing).
pl1 <- data_buy %>%
ggplot(aes(x=price_per_m2)) +
geom_density(color="#69b3a2", alpha=0.8, size=1.5) +
geom_vline(aes(xintercept=median(price_per_m2)), linetype="dashed", size=0.8, color="69b3a2") +
annotate("text", x=12100, y=0.00004, label=paste("Median: ", as.character(median(round(data_buy$price_per_m2, 2)))), angle=90, color="69b3a2", size=15) +
ggtitle("Purchase market") +
labs(x="PLN per square meter", y="Density") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
pl2 <- data_rent %>%
ggplot(aes(x=price_per_m2)) +
geom_density(color="#69b3a2", alpha=0.8, size=1.5) +
geom_vline(aes(xintercept=median(price_per_m2)), linetype="dashed", size=0.8, color="69b3a2") +
annotate("text", x=57, y=0.01, label=paste("Median: ", as.character(median(round(data_rent$price_per_m2, 2)))), angle=90, color="69b3a2", size=15) +
ggtitle("Rental market") +
labs(x="PLN per square meter", y="Density") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
grid.arrange(pl1+theme(legend.position="top"),
pl2+theme(legend.position="top"),
ncol = 2,
top =textGrob("Price per square meter",gp=gpar(fontsize=35)))
This image shows density plots for the price per square meter in both the purchase and rental markets. The distribution graphs of prices per square meter are less skewed than normal prices and additionally:
Purchase Market
The distribution appears right-skewed (positively skewed), meaning more properties have prices below the median, with some high-priced outliers.
The median price per square meter is 13,322.98 PLN.
There’s a peak around 10,000-15,000 PLN, suggesting a common price range.
Rental Market
Also right-skewed, but with a sharper peak and a steeper decline.
The median rental price per square meter is 65.67 PLN.
Most rentals are concentrated in the 40-80 PLN per square meter range.
type_across_cities <- table(data_buy$city, data_buy$type, dnn = c("City", "Type"))
addmargins(round(prop.table(type_across_cities, 1) * 100, 2), 2) %>%
kable() %>%
kable_styling()
| apartmentBuilding | blockOfFlats | tenement | Sum | |
|---|---|---|---|---|
| bialystok | 11.50 | 85.25 | 3.24 | 99.99 |
| bydgoszcz | 8.77 | 65.95 | 25.28 | 100.00 |
| czestochowa | 7.33 | 85.35 | 7.33 | 100.01 |
| gdansk | 19.30 | 64.85 | 15.85 | 100.00 |
| gdynia | 20.62 | 67.39 | 11.99 | 100.00 |
| katowice | 19.40 | 56.86 | 23.74 | 100.00 |
| krakow | 20.05 | 65.68 | 14.27 | 100.00 |
| lodz | 15.07 | 69.70 | 15.23 | 100.00 |
| lublin | 3.48 | 87.12 | 9.40 | 100.00 |
| poznan | 14.87 | 64.91 | 20.23 | 100.01 |
| radom | 3.08 | 89.23 | 7.69 | 100.00 |
| rzeszow | 12.73 | 82.94 | 4.33 | 100.00 |
| szczecin | 8.04 | 62.08 | 29.88 | 100.00 |
| warszawa | 26.58 | 60.10 | 13.32 | 100.00 |
| wroclaw | 24.76 | 55.13 | 20.12 | 100.01 |
The table shows percentage of housing of the certain type per city. There are cities with higher percentage of apartment buildings (Gdansk, Gdynia, Krakow, Warszawa and Wroclaw), which are characterized by greater market share and higher prices, while in smaller cities with lower prices relative quantity of offers in blocks of flats is higher (Bialystok, Czestohchowa, Lublina, Radom). Nonetheless there is a common trend for each of the cities: the majority of housings is offered in blocks of flats, while apartments of buildings and tenements make up a smaller group of housings.
type_across_cities <- table(data_rent$city, data_rent$type, dnn = c("City", "Type"))
addmargins(round(prop.table(type_across_cities, 1) * 100, 2), 2) %>%
kable() %>%
kable_styling()
| apartmentBuilding | blockOfFlats | tenement | Sum | |
|---|---|---|---|---|
| bialystok | 26.01 | 70.40 | 3.59 | 100.00 |
| bydgoszcz | 21.07 | 57.73 | 21.20 | 100.00 |
| czestochowa | 16.52 | 72.17 | 11.30 | 99.99 |
| gdansk | 38.86 | 51.71 | 9.43 | 100.00 |
| gdynia | 26.92 | 61.51 | 11.57 | 100.00 |
| katowice | 39.99 | 48.80 | 11.22 | 100.01 |
| krakow | 41.35 | 46.17 | 12.48 | 100.00 |
| lodz | 45.86 | 45.67 | 8.46 | 99.99 |
| lublin | 19.30 | 67.25 | 13.45 | 100.00 |
| poznan | 31.75 | 56.52 | 11.73 | 100.00 |
| radom | 20.00 | 72.17 | 7.83 | 100.00 |
| rzeszow | 28.71 | 66.13 | 5.16 | 100.00 |
| szczecin | 27.96 | 59.67 | 12.38 | 100.01 |
| warszawa | 49.34 | 36.02 | 14.64 | 100.00 |
| wroclaw | 45.61 | 45.07 | 9.33 | 100.01 |
The situation is somewhat different for the rental market. There is an increase in apartment buildings type offers resulting in an incline in percentage of block of flats. Quantity of offers in tenements is also reduced insignificantly.
| type | mean | sd | median | min | max | skewness | kurtosis |
|---|---|---|---|---|---|---|---|
| apartmentBuilding | 16835.35 | 4883.377 | 16614.36 | 4137.931 | 32096.77 | 0.3299874 | 2.672051 |
| blockOfFlats | 12820.99 | 4233.305 | 12531.38 | 3435.864 | 30935.48 | 0.5797487 | 3.329609 |
| tenement | 13848.69 | 5904.646 | 12983.91 | 3000.000 | 31041.67 | 0.4951448 | 2.487640 |
The most expensive in terms of price per square meter are housings in apartment buildings. Their average price is 26.8% more for blocks of flats and 25.1% for tenements in purchasing. The smallest standard deviation** value for blocks of flats in both markets can be explained by average characteristics of this type of housing: locations and conditions are close in different blocks of flats or apartments, while there are old, cheaper tenements and relatively small amount of the new ones in perfect condition and location so the deviation is significantly larger. Positive values of skewness indicate the tail of the distribution is on the right, which means the amount of more expensive apartments is fading more gradually than the cheaper ones. Kurtosis indicates quite peaked distributions for each category.
| type | mean | sd | median | min | max | skewness | kurtosis |
|---|---|---|---|---|---|---|---|
| apartmentBuilding | 75.76773 | 23.22751 | 72.37607 | 11.30396 | 187.8378 | 1.1240927 | 5.142059 |
| blockOfFlats | 61.56362 | 19.15648 | 59.25926 | 11.29861 | 189.4737 | 0.9502302 | 5.420271 |
| tenement | 70.01260 | 24.04812 | 67.30769 | 11.29971 | 189.3146 | 0.7050550 | 3.895923 |
Apartments in apartment buildings have the highest average rent (75.77 PLN/m²) and median (72.38 PLN/m²), suggesting they tend to be more expensive. Block of flats has the lowest average (61.56 PLN/m²) and median (59.26 PLN/m²), meaning they are generally more affordable, while tenements are in between as it was for purchasing data set. Standard deviation is highest for tenements (24.05 PLN/m²) and apartment buildings (23.23 PLN/m²), indicating a wider spread in rental prices. The maximum rental price is quite high for all categories (~189 PLN/m²), suggesting some luxury or premium properties significantly impact the upper range. Apartment buildings (5.14) and block of flats (5.42) have higher kurtosis, indicating a sharper peak with more extreme values. Tenements (3.90) have a more moderate kurtosis, meaning a more balanced distribution without extreme outliers.
mu_buy <- ddply(data_buy, "type", summarise, grp.mean=mean(price_per_m2))
pl1 <- ggplot(data_buy, aes(x=price_per_m2, color=type)) +
geom_density(size=1.5) +
theme(legend.position="top") +
geom_vline(data=mu_buy, aes(xintercept=grp.mean, color=type), linetype="dashed") +
scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9")) +
theme_classic() +
labs(x="PLN per square meter", y="Density", subtitle = "Purchase market") +
theme() +
theme(text = element_text(size = 25))
mu_rent <- ddply(data_rent, "type", summarise, grp.mean=mean(price_per_m2))
pl2 <- ggplot(data_rent, aes(x=price_per_m2, color=type)) +
geom_density(size=1.5) +
geom_vline(data=mu_rent, aes(xintercept=grp.mean, color=type), linetype="dashed") +
scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9")) +
theme_classic() +
labs(x="PLN per square meter", y="Density", subtitle = "Rental market") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
grid.arrange(pl1+theme(legend.position="top"),
pl2+theme(legend.position="top"),
ncol = 2,
top =textGrob("Distribution of price per square meter by housing type",gp=gpar(fontsize=35))
)
This image presents density plots for the distribution of price per square meter in both the purchase and rental markets, categorized by housing type (apartment buildings, blocks of flats and tenements). Purchase prices are more spread out than rental prices, particularly for apartment buildings. Rental prices are relatively more concentrated, with fewer extreme values. Blocks of flats consistently show the lowest prices and least variation, making them the most affordable option. Apartment buildings are the most expensive, both in purchase and rental markets, with more high-priced outliers.
pl1 <- ggplot(data_buy, aes(x=type, fill=buildingMaterial))+
geom_bar(stat="count", width=0.95) +
scale_fill_manual(values=c("darkred", "grey")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
pl2 <-ggplot(data_rent, aes(x=type, fill=buildingMaterial))+
geom_bar(stat="count", width=0.95) +
scale_fill_manual(values=c("darkred", "grey")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
grid.arrange(pl1+theme(legend.position="top"),
pl2+theme(legend.position="top"),
ncol = 2,
top =textGrob("Housing type vs building material",gp=gpar(fontsize=35))
)
The majority of housings is build of brick, while this material is used more frequently in apartment buildings and tenements than in blocks of flats, which are made of concrete slab in one third of cases. The trend is common for purchasing and rental markets.
Statistical tests are conducted to determine whether observed data patterns are due to chance or represent real effects. They help in making objective decisions, validating hypotheses, and drawing conclusions in research by measuring relationships, differences, or associations within data sets.
The main purpose of the statistical test we are going to perform is to determine which variables have a statistically significant impact on both rent and buy prices.
palette_colors <- viridis(length(unique(complete_data_buy$city)))
complete_data_buy %>%
ggbetweenstats(
y = price,
x = city,
bf.message = FALSE,
pairwise.display = "none",
title = "Distribution of buy price across cities - buy dataset",
palette = "default_ucscgb",
package = "ggsci",
centrality.label.args = list(size = 7)
) +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))
complete_data_rent %>%
ggbetweenstats(
y=price,
x=city,
bf.message = FALSE,
pairwise.display = "none",
title = "Distribution of rent price across cities - rent dataset",
palette = "default_ucscgb",
package = "ggsci",
centrality.label.args = list(size = 7)
) +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))
Based on Welch’s test, we obtained a p-value close to 0, so we reject the null hypothesis of equal means between populations in both the buy and rent data sets. This means that the city statistically significantly differentiates both rent and buy prices. However, an effect size of 0.86 implies that this differentiation has a moderate magnitude.
amentities <- c("hasBalcony", "hasElevator", "hasParkingSpace", "hasStorageRoom", "hasSecurity")
lst <- list()
for (i in amentities) {
lst[[i]] <-
ggbetweenstats(complete_data_buy,
y = price,
x = !!sym(i),
bf.message = FALSE,
palette = "default_ucscgb",
package = "ggsci",
title = i,
centrality.label.args = list(size = 7),
ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
) +
theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25)) +
xlab("")
}
grid.arrange(grobs=lst,
ncol = 3,
nraw = 2,
top =textGrob("Buy Price Distribution by amenities",gp=gpar(fontsize=35))
)
All amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.14 for a balcony to -0.57 for an elevator.
for (i in amentities) {
lst[[i]] <-
ggbetweenstats(complete_data_buy,
y = price,
x = !!sym(i),
bf.message = FALSE,
palette = "default_ucscgb",
package = "ggsci",
title = i,
centrality.label.args = list(size = 7),
ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
) +
theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25)) +
xlab("")
}
grid.arrange(grobs=lst,
ncol = 3,
nraw = 2,
top =textGrob("Rent Price Distribution by amenities",gp=gpar(fontsize=35))
)
Similarly to buy prices, all amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.10 for a balcony to -0.31 for an elevator.
pl1 <- ggbetweenstats(complete_data_buy,
y = price,
x = condition,
bf.message = FALSE,
palette = "default_ucscgb",
package = "ggsci",
title = "Buy data set",
centrality.label.args = list(size = 10),
ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
) +
theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))
pl2 <- ggbetweenstats(complete_data_rent,
y = price,
x = condition,
bf.message = FALSE,
palette = "default_ucscgb",
package = "ggsci",
title = "Rental data set",
centrality.label.args = list(size = 10),
ggplot.component = list(theme(plot.subtitle = element_text(size = 12, face = "bold")))
) +
theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25)) +
ylab("")
grid.arrange(pl1+theme(legend.position="top"),
pl2+theme(legend.position="top"),
ncol = 2,
top =textGrob("Price difference by condition",gp=gpar(fontsize=35))
)
Finally, condition of flat in both rent and buy data sets also seem to differentiate price significantly but has only moderate magnitude.
This project provides a comprehensive analysis of the real estate market, utilizing statistical methods and visualizations to uncover key trends and insights. The study examines data structure, performs data cleansing, and addresses outliers to ensure accuracy. Through descriptive analysis and summary statistics, the project identifies variations in market share across cities and explores price distribution patterns.
The following problems were explored among others:
A data-driven overview of real estate pricing trends.
Identification of key factors influencing property prices.
Insights into how housing types and amenities impact pricing.
Statistical validation of differences in pricing across various cities.
Certain cities (e.g., Gdańsk, Gdynia, Częstochowa) show significant missing data in condition and build year attributes, however there is no strong correlation between missing values and other features. The highest property prices are in Warsaw, Krakow, and Gdansk, while the cheapest cities are Czestochowa, Radom, and Bydgoszcz. Property prices follow a fat-tailed distribution, while high-priced properties are concentrated in city centers with dense places of interest (PoI). No strong correlation is found between price levels and proximity to public places, but distances between amenities show a strong relationship.
Rental and purchase price distributions by cities, amenities and condition differ significantly, as confirmed by Welch’s test (p-value ≈ 0). Rental prices are more uniform, except in Warsaw, where the median rent is 30% higher than in other cities. A positive correlation exists between apartment size and price, with larger price increases in cities with higher property costs. Amenities generally increase housing prices, except for storage rooms, which correlate with lower prices, but the effect size is small (e.g., -0.14 for balconies, -0.57 for elevators). Housing condition has a moderate impact on prices, but its effect size is limited.
The study confirms that city location, property size, and amenities significantly influence real estate prices. While rental prices show more uniformity, purchase prices exhibit stronger variations based on property type and city. Warsaw dominates both markets, with modern buildings losing value over time. Statistical tests validate these findings, providing insights for investors, buyers, and policymakers.