Introduction

The real estate market plays a crucial role in economic growth, urban development, and investment strategies. Understanding its structure, pricing trends, and key influencing factors is essential for stakeholders, including buyers, investors, and policymakers. This project provides an in-depth analysis of the real estate market using statistical methods and data visualization techniques to extract meaningful insights from the available data.

The following milestones were outlined during design process:

  1. Explatory Analysis: Understanding the format, types, and sources of data.

  2. Data Cleansing and Wrangling: Ensuring data accuracy, handling missing values, and standardizing formats.

  3. Outliers: Identifying and analyzing unusual price points that may affect overall trends.

  4. Visualizations: Graphical representations to facilitate pattern recognition and trend analysis.

  5. Descriptive Analysis: Exploring key metrics like mean, median, variance, and skewness, recovered aggregated statistical insights that describe the market dynamics.

  6. Price Distribution Shape: Analyzing the distribution of prices through density plots and statistical summaries.

  7. Statistical Tests: Assessing the statistical significance of median price differences across cities and investigating the impact of amenities on price variations.

Objectives

  • Provide a clear understanding of real estate price structures.

  • Identify significant price differentiators and market trends.

  • Utilize statistical techniques to validate insights.

  • Present findings through comprehensive visual and tabular representations.

## Downloading necessary libraries
# Installing the libraries
if (!require(tidyverse)) install.packages("tidyverse") 
if (!require(dplyr)) install.packages("dplyr")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(naniar)) install.packages("naniar")
if (!require(mice)) install.packages("mice")
if (!require(modelsummary)) install.packages("modelsummary")
if (!require(reshape2)) install.packages("reshape2")
if (!require(editrules)) install.packages("editrules")
if (!require(dlookr)) install.packages("dlookr")
if (!require(editrules)) install.packages("editrules")
if (!require(VIM)) install.packages("VIM")
if (!require(deducorrect)) install.packages("deducorrect")
if (!require(ISLR)) install.packages("ISLR")
if (!require(outliers)) install.packages("outliers")
if (!require(ggstatsplot)) install.packages("ggstatsplot")
if (!require(scales)) install.packages("scales")
if (!require(corrplot)) install.packages("corrplot")
if (!require(gridExtra)) install.packages("grid.extra")
if (!require(kableExtra)) install.packages("kable.extra")
if (!require(plyr)) install.packages("plyr")
if (!require(summarytools)) install.packages("summarytools")
if (!require(classInt)) install.packages("classInt")
if (!require(psych)) install.packages("psych")
if (!require(here)) install.packages("here")
if (!require(tidyr)) install.packages("tidyr")
if (!require(plotly)) install.packages("plotly")
if (!require(egg)) install.packages("egg")

# Attaching the libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(naniar)
library(mice)
library(modelsummary)
library(reshape2)
library(editrules)
library(dlookr)
library(editrules)
library(VIM)
library(deducorrect)
library(ISLR)
library(outliers)
library(ggstatsplot)
library(RColorBrewer)
library(viridis)
library(rlang)
library(scales)
library(corrplot)
library(gridExtra)
library(kableExtra)
library(plyr)
library(summarytools)
library(classInt)
library(psych)
library(here)
library(tidyr)
library(plotly)
library(egg)

Data structure

The dataset contains apartment sales and rental offers from 15 largest cities in Poland (Warsaw, Lodz, Krakow, Wroclaw, Poznan, Gdansk, Szczecin, Bydgoszcz, Lublin, Katowice, Bialystok, Czestochowa). The data comes from local websites with apartments for sale. In order to fully capture the neighborhood of each apartment, each listing has been augmented with Open Street Map data with distances to points of interest (POIs). The data is collected monthly and covers the period from August 2023 to June 2024.

Files

apartments_pl_YYY_MM.csv - monthly snapshot of sales listings

apartments_rent_pl_YYYY_MM.csv - monthly snapshot of rental offers

Data fields

  • city - name of the city where the property is located
  • type - type of building
  • squareMeters - size of the apartment in square meters
  • rooms - number of rooms in the apartment
  • floor / floorCount - floor on which the apartment is located and the total number of floors in the building
  • buildYear - the year in which the building was built
  • latitude, longitude - geographical coordinates of the property
  • centerDistance - distance from the city center in km
  • poiCount - number of places of interest within a radius of 500m from the apartment (schools, clinics, post office, kindergartens, restaurants, universities, pharmacies)
  • Distance - distance to the nearest point of interest (schools, clinics, post office, etc.)

Important data manipulations conducted in this step are adding months of when the data come from to the dataset, and binding files. In the raw state of data, this date is only included in the name of the file, and also we have separate files for each month.

# Loading data
load_data <- function(name_struct) {
  files_names = 
    # Listing files in the proper location and with given pattern
    list.files(
    path = file.path("./analiza_danych_projekt_zespolowy/Nieruchomosci w Polsce"),
    pattern = name_struct, 
    full.names = TRUE)
  myfiles = lapply(files_names, read.csv)
  
  # adding column "month", labeling data from which month it was scrapped
  for (i in seq_along(myfiles)) {myfiles[[i]]$month = to_date(files_names[i])}
  
  # bind together dataframe
  return(do.call(rbind, myfiles))
}

# adding new column to each DF with month
to_date <- function(primary_date) {
  date_part <- sub(".*(\\d{4}_\\d{2}).*", "\\1", primary_date)
  complete_date <- paste0(date_part, "_01")
  
  parsed_date <- 
    as.Date(complete_date, format = "%Y_%m_%d")
  
  return(parsed_date)
}

data_buy = load_data("^apartments_pl_[0-9]+_[0-9]+\\.csv$")
data_rent = load_data("^apartments_rent_pl_[0-9]+_[0-9]+\\.csv$")

Data Cleansing and Wrangling

Data cleansing and wrangling are essential steps in the data analysis process, ensuring that raw data is transformed into a structured, accurate, and usable format. Data cleansing involves among others handling missing values, outliers and identifying and correcting errors. Data wrangling focuses on reshaping, merging, and transforming datasets to make them suitable for analysis.

The goal of these processes is to enhance data integrity, eliminate inconsistencies, and prepare the dataset for meaningful insights.

Removing unnecessary data

Row IDs are unique and therefore provide no differentiation or analytical value. Additionally, due to the nature of the data, duplicate values are possible. For example, if a property was listed for sale in January but sold in March, it would appear three times—once for each month it was listed.

Data summary

A brief exploration of the dataset reveals that both the rent and buy datasets share the same 28 features, which include both categorical and continuous variables. The buy dataset contains 92 967 unique records, while the rent dataset has 37 941 unique records. An interesting observation is that the data includes not only missing values (NAs) but also different type of blank entries, which will be addressed in the next section.

Checking NAs

knitr::kable(data.frame(miss_var_summary(data_buy)))
variable n_miss pct_miss
condition 69914 75.2
buildingMaterial 39179 42.1
type 19792 21.3
floor 15984 17.2
buildYear 15641 16.8
hasElevator 4448 4.78
collegeDistance 2492 2.68
floorCount 1082 1.16
clinicDistance 333 0.358
restaurantDistance 226 0.243
pharmacyDistance 128 0.138
postOfficeDistance 99 0.106
kindergartenDistance 85 0.0914
schoolDistance 60 0.0645
city 0 0
squareMeters 0 0
rooms 0 0
latitude 0 0
longitude 0 0
centreDistance 0 0
poiCount 0 0
ownership 0 0
hasParkingSpace 0 0
hasBalcony 0 0
hasSecurity 0 0
hasStorageRoom 0 0
price 0 0
month 0 0
knitr::kable(data.frame(miss_var_summary(data_rent)))
variable n_miss pct_miss
condition 27790 73.2
buildingMaterial 16157 42.6
buildYear 10570 27.9
type 9190 24.2
floor 4646 12.2
hasElevator 2096 5.52
floorCount 730 1.92
collegeDistance 490 1.29
restaurantDistance 82 0.216
pharmacyDistance 38 0.100
clinicDistance 32 0.0843
kindergartenDistance 28 0.0738
postOfficeDistance 19 0.0501
schoolDistance 11 0.0290
city 0 0
squareMeters 0 0
rooms 0 0
latitude 0 0
longitude 0 0
centreDistance 0 0
poiCount 0 0
ownership 0 0
hasParkingSpace 0 0
hasBalcony 0 0
hasSecurity 0 0
hasStorageRoom 0 0
price 0 0
month 0 0

Both data sets have some serious problems with missing data, especially in condition and building material.

Data set ‘buy’

# Upset plot of missing data and number of missings by a city - buy dataset
NA_vector_buy <- c("floor", "buildYear", "collegeDistance", "floorCount", 
               "clinicDistance", "restaurantDistance", "pharmacyDistance", 
               "postOfficeDistance", "kindergartenDistance", "schoolDistance",
               "condition", "buildingMaterial", "type", "hasElevator")

gg_miss_upset(data_buy, nsets=14, mb.ratio = c(0.5, 0.5), text.scale = c(2.5,2.5,2.5,2.5,1.9,1.5),
              set_size.numbers_size=TRUE, set_size.show = TRUE)

gg_miss_fct(data_buy[,c(NA_vector_buy, "city")], fct = city) +
  theme(text = element_text(size = 25)) 

An analysis of missing values in the ‘buy’ dataset using an upset plot reveals some connections between condition, building material, and type. However, given the scale of missing values in these features, their mutual occurrence does not appear to be significant. On the other hand, in Gdańsk and Gdynia, the ‘condition’ attribute is missing more frequently than in other cities, while in Częstochowa, the ‘build year’ is notably absent more often.

Data set ‘rent’

# Upset plot of missing data and number of missings by a city - rent dataset
NA_vector_rent <- c("condition", "floor", "buildYear", "floorCount", "collegeDistance",
               "clinicDistance", "restaurantDistance", "pharmacyDistance", 
               "postOfficeDistance", "kindergartenDistance", "schoolDistance")

gg_miss_upset(data_rent, nsets=10, mb.ratio = c(0.5, 0.5), text.scale = c(2.5,2.5,2.5,2.5,1.9,1.5),
              set_size.numbers_size=TRUE, set_size.show = TRUE)

gg_miss_fct(data_rent[,c(NA_vector_rent, "city")], fct = city) +
  theme(text = element_text(size = 25)) 

A similar pattern is observed in the ‘rent’ dataset. Once again, some connections between condition, building material, and type are visible, but they do not appear to be significant. In this case, missing condition data is noticeable not only in Gdańsk and Gdynia but also in Częstochowa, where the ‘build year’ attribute is also frequently missing.

In the next steps will be checked correlation between NAs and other variable, for this purpose we will conduct below steps:

  • code NAs to 1 (in case of NA) and 0

  • code binary, categorical variables

  • exclude non binar variables

# Coding NAs
NA_data_buy = data_buy
NA_data_rent = data_rent

for (i in 1:length(NA_vector_buy)) {
  col_name <- paste("NA",NA_vector_buy[i], sep="_")
  NA_data_buy[[col_name]] <- ifelse(is.na(NA_data_buy[[NA_vector_buy[i]]]), 1, 0)
}

for (i in 1:length(NA_vector_rent)) {
  col_name <- paste("NA",NA_vector_rent[i], sep="_")
  NA_data_rent[[col_name]] <- ifelse(is.na(NA_data_rent[[NA_vector_rent[i]]]), 1, 0)
}

# Coding categorical variables
NA_data_buy$condition <- 
  NA_data_buy$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1

NA_data_rent$condition <- 
  NA_data_rent$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1

yes_no_variables <- c("hasParkingSpace", "hasBalcony", "hasElevator", 
                      "hasSecurity", "hasStorageRoom")

for (i in 1:length(yes_no_variables)){
  NA_data_buy[[yes_no_variables[i]]] <- 
    NA_data_buy[[yes_no_variables[i]]] %>% 
    ordered(c("no", "yes")) %>% 
    as.numeric()-1
  
  NA_data_rent[[yes_no_variables[i]]] <- 
    NA_data_rent[[yes_no_variables[i]]] %>% 
    ordered(c("no", "yes")) %>% 
    as.numeric()-1
}

# vector of features to exclude
exclude_HM <- c("city", "type", "ownership", "buildingMaterial", "month", 
                "condition", "hasElevator")

# Heat maps

selected_features1 <- names(NA_data_buy)[startsWith(names(NA_data_buy), "NA_")]
selected_features2 <- names(NA_data_buy)[!startsWith(names(NA_data_buy), "NA_")]

NA_data_buy[,!names(NA_data_buy) %in% exclude_HM] %>%
  cor(use = "pairwise.complete.obs") %>%
  round(digits=2) %>%
  melt() %>%
  filter(Var1 %in% selected_features2 & Var2 %in% selected_features1) %>%
  ggplot(aes(x=Var1, y=Var2, fill=value))  +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Correlation matrix with NA - 'buy' dataset") +
  scale_fill_gradient2(low = "red", mid = "#FFDD94", high = "green", midpoint = 0)+
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))  +
  theme(text = element_text(size = 25)) 

selected_features1 <- names(NA_data_rent)[startsWith(names(NA_data_rent), "NA_")]
selected_features2 <- names(NA_data_rent)[!startsWith(names(NA_data_rent), "NA_")]
  
NA_data_rent[, !names(NA_data_rent) %in% exclude_HM] %>%
  cor(use = "pairwise.complete.obs") %>%
  round(digits = 2) %>%
  melt() %>%
  filter(Var1 %in% selected_features2 & Var2 %in% selected_features1) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Correlation matrix with NA - 'rent' dataset") +
  scale_fill_gradient2(low = "red", mid = "#FFDD94", high = "green", midpoint = 0) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  theme(text = element_text(size = 25)) 

Correlation matrix doesn’t show any strong dependencies between missing and other variables

Outliers

For finding outliers we used couple methods:

  • Grubbs’ test

  • boxplots

  • Interqunatile Range Method - Q-Q plot

We paid special attention to price since this is the subject of our research. Grubbs’ test allows us defining features with potential outliers:

categorical_var <- c("city", "type", "ownership", "buildingMaterial", "condition", "hasParkingSpace", "hasBalcony", "hasElevator", "hasSecurity", "hasStorageRoom", "id", "month")

grubbs_test_buy <- 
  sapply(data_buy[,!names(data_buy) %in% categorical_var], grubbs.test)
var_buy <- cbind(grubbs_test_buy["p.value",][grubbs_test_buy["p.value",] < 0.05])
knitr::kable(var_buy, caption = "Variables with p-value lower than 5% (buy data set)", col.names = c("Variable", "p-value"))
Variables with p-value lower than 5% (buy data set)
Variable p-value
floor 0
floorCount 1.00849896211841e-07
poiCount 0
schoolDistance 0
postOfficeDistance 0
kindergartenDistance 0
restaurantDistance 0
pharmacyDistance 0
price 2.03623467800451e-05
grubbs_test_rent <- 
 sapply(data_rent[,!names(data_rent) %in% categorical_var], grubbs.test)
var_rent <- cbind(grubbs_test_rent["p.value",][grubbs_test_rent["p.value",] < 0.05])
knitr::kable(var_rent, caption = "Variables with p-value lower than 5% (rent data set)", col.names = c("Variable", "p-value"))
Variables with p-value lower than 5% (rent data set)
Variable p-value
floor 0
floorCount 2.41967365033346e-08
buildYear 0.0077240047029381
centreDistance 0.0210443044676376
poiCount 2.45829663292341e-08
schoolDistance 0
clinicDistance 0.000174023317614402
postOfficeDistance 0
kindergartenDistance 0
restaurantDistance 0
pharmacyDistance 0
price 0

The strongest outliers are suspected in floor, price, and proximity-related variables (schoolDistance, postOfficeDistance, etc.), suggesting unusually high or low values in these attributes. Outliers in rental prices could indicate luxury apartments or extremely cheap listings. Floor-related outliers might suggest penthouse apartments or ground-floor units with unusual characteristics. The rental data set has more variables with outlier candidates.

Outliers visualization for the ‘buy’ data set

# boxplots buy
bp1 <- ggplot(pivot_longer(data_buy[c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")], cols=c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Public destination distance outliers") + 
  labs(x="", y="Miles") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
  theme(text = element_text(size = 25)) 

bp2 <- ggplot(pivot_longer(data_buy[c("floor", "floorCount")], cols=c("floor", "floorCount")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Floor and floor count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp3 <- ggplot(pivot_longer(data_buy[c("poiCount")], cols="poiCount"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("PoI count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp4 <- ggplot(pivot_longer(data_buy[c("price")], cols="price"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Price outliers") + 
  labs(x="", y="PLN") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

ggarrange(bp1, bp2, bp3, bp4, ncol = 2, nrow = 2)

Outliers visualization for the ‘rent’ data set

# boxplots rent
bp1 <- ggplot(pivot_longer(data_rent[c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")], cols=c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Public destination distance outliers") + 
  labs(x="", y="Miles") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
  theme(text = element_text(size = 25)) 

bp2 <- ggplot(pivot_longer(data_rent[c("floor", "floorCount")], cols=c("floor", "floorCount")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Floor and floor count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp3 <- ggplot(pivot_longer(data_rent[c("poiCount")], cols="poiCount"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("PoI count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp4 <- ggplot(pivot_longer(data_rent[c("price")], cols="price"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Price outliers") + 
  labs(x="", y="PLN") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

ggarrange(bp1, bp2, bp3, bp4, ncol = 2, nrow = 2)

Some apartments are located in remote areas with poor access to these public facilities. Outliers in floor and floorCount variables indicate penthouses or very high-floor apartments in skyscrapers. High value of PoI suggests that some apartments are located in highly urbanized areas with dense amenities, while high prices indicate luxury apartments or highly premium properties. The boxplots exhibit a dense concentration of outliers, with no observations noticeably separated from the rest. Above mentioned suggests that the detected outliers are not due to erroneous data but rather a result of a fat-tailed distribution in the given features.

# Interqunatile Range Method for price outliers
buy_Q1 <- quantile(data_buy$price, 0.25, na.rm = TRUE)
buy_Q3 <- quantile(data_buy$price, 0.75, na.rm = TRUE)
buy_IQR_value <- buy_Q3 - buy_Q1
buy_lower_bound <- max(0, buy_Q1 - 1.5 * buy_IQR_value)  # Adjust lower bound to 0, because price cannot be negative
buy_upper_bound <- buy_Q3 + 1.5 * buy_IQR_value

rent_Q1 <- quantile(data_rent$price, 0.25, na.rm = TRUE)
rent_Q3 <- quantile(data_rent$price, 0.75, na.rm = TRUE)
rent_IQR_value <- rent_Q3 - rent_Q1
rent_lower_bound <- max(0, rent_Q1 - 1.5 * rent_IQR_value)  # Adjust lower bound to 0, because price cannot be negative
rent_upper_bound <- rent_Q3 + 1.5 * rent_IQR_value

# Identify price outliers 
buy_outliers <- 
  data_buy[data_buy$price < buy_lower_bound | data_buy$price > buy_upper_bound,]
rent_outliers <- 
  data_rent[data_rent$price < rent_lower_bound | data_rent$price > rent_upper_bound,]

# See how many values are outliers related to the original dataset
print(paste("Price outliers related to the original dataset - buy ", as.character(round((length(buy_outliers$price)/length(data_buy$price)) * 100, 2)), "%", sep=""))
## [1] "Price outliers related to the original dataset - buy 6.09%"
print(paste("Price outliers related to the original dataset - rent ", as.character(round(length(rent_outliers$price)/length(data_rent$price) * 100, 2)), "%", sep=""))
## [1] "Price outliers related to the original dataset - rent 7.57%"

Taking a closer look at the rent outliers

# Taking a closer look at the rent outliers
knitr::kable(data_rent[data_rent$price == min(data_rent$price), ][c("city", "price", "squareMeters", "poiCount")])
city price squareMeters poiCount
30685 katowice 346 30.36 4
knitr::kable(data_rent[data_rent$price == max(data_rent$price), ][c("city", "price", "squareMeters", "poiCount")])
city price squareMeters poiCount
32075 warszawa 23000 148.0 62
33121 warszawa 23000 147.6 60
33521 warszawa 23000 126.0 100

While 346 PLN for rent may seem low, there are multiple observations close to this price. There are 3 observations with price of 23 000 PLN, all located in the center of Warsaw, where there is a high density of points of interest (POI).

# Taking a closer look at the buy outlier
knitr::kable(data_buy[data_buy$price == min(data_buy$price), ][c("city", "price", "squareMeters", "poiCount")])
city price squareMeters poiCount
18202 bydgoszcz 150000 31.02 8
18206 bydgoszcz 150000 35.94 8
knitr::kable(data_buy[data_buy$price == max(data_buy$price), ][c("city", "price", "squareMeters", "poiCount")])
city price squareMeters poiCount
79943 warszawa 3250000 133.51 9
80351 warszawa 3250000 136.00 2
84104 warszawa 3250000 133.51 13
129837 warszawa 3250000 131.70 26

For apartments listed for sale, neither a price of 1,500,000 PLN nor 3,250,000 PLN appears alarming or unusual. There are multiple listings with prices close to these values, suggesting they are within a reasonable range.

An additional confirmation of the price distribution would be a Q-Q plot. Since apartment prices are typically log-normally distributed, a Q-Q plot of log-transformed rent prices would help assess their conformity to this distribution.

qqnorm(log(data_buy$price), main = "Normal Q-Q Plot ('buy' data set)", cex.lab=3, cex.main=3, cex.axis=3)
qqline(log(data_buy$price))

qqnorm(log(data_rent$price), main = "Normal Q-Q Plot ('rent' data set)", cex.lab=3, cex.main=3, cex.axis=3)
qqline(log(data_rent$price))

In our case distributions are fat-tailed.

Data validation

Data validation ensures data correctness by verifying logical consistency and identifying anomalies. In this data set, we checked the following properties:

  • Floor number is less than or equal to the total floor count.

  • Build year is less than 2024.

  • Latitude and longitude fall within Poland’s geographic range.

# Data validation
RULE <- editfile("RULES")
violated_buy <- violatedEdits(RULE, data_buy)
# summary(violated_buy)

RULE <- editfile("RULES")
violated_rent <- violatedEdits(RULE, data_rent)
# summary(violated_rent)

Data imputation

Since there are no incorrect data or outliers, we can proceed with imputing missing values using the MICE method, specifically the classification and regression trees (CART) variation. First, we encode text and categorical variables to make them suitable for imputation. Next, we perform the imputation, verify its success, and finally decode the variables back to their original format.

To avoid retraining the model each time, we will save the imputation results and disable the relevant cells from running in every Markdown execution.

# coding data - buy dataset
data_buy$condition <- 
  data_buy$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1 

data_buy$buildingMaterial <- 
  data_buy$buildingMaterial %>% 
  ordered(c("brick", "concreteSlab")) %>% 
  as.numeric()-1 

data_buy$type <- 
  data_buy$type %>% 
  ordered(c("apartmentBuilding", "blockOfFlats", "tenement")) %>% 
  as.numeric()-1 

data_buy$hasElevator <- 
  data_buy$hasElevator %>% 
  ordered(c("no", "yes")) %>% 
  as.numeric()-1 
# MICE imputation
imp_buy <- mice(data_buy, m=3, maxit=3, method="cart")

# Pool results into final dataset
complete_data_buy <- complete(imp_buy)
# evaluating imputation results
data.frame(miss_var_summary(complete_data_buy))

Data was successfully imputed

# decoding variables
complete_data_buy$condition <- ifelse(complete_data_buy$condition == 0, "low", "premium")
complete_data_buy$buildingMaterial <- ifelse(complete_data_buy$buildingMaterial == 0, "brick", "concreteSlab")
complete_data_buy$type <- ifelse(complete_data_buy$type == 0, "apartmentBuilding",
                        ifelse(complete_data_buy$type == 1, "blockOfFlats", "tenement"))
complete_data_buy$hasElevator <- ifelse(complete_data_buy$hasElevator == 0, "no", "yes")
# saving imputed data
write.csv(complete_data_buy,"clean_buy.csv")
# coding vriables - rent dataset
data_rent$condition <- 
  data_rent$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1 

data_rent$buildingMaterial <- 
  data_rent$buildingMaterial %>% 
  ordered(c("brick", "concreteSlab")) %>% 
  as.numeric()-1 

data_rent$type <- 
  data_rent$type %>% 
  ordered(c("apartmentBuilding", "blockOfFlats", "tenement")) %>% 
  as.numeric()-1 

data_rent$hasElevator <- 
  data_rent$hasElevator %>% 
  ordered(c("no", "yes")) %>% 
  as.numeric()-1
# Multiple data imputation
imp_rent <- mice(data_rent, m=3, maxit=3, method="cart")

# Pool results into final dataset
complete_data_rent <- complete(imp_rent)
# evaluating imputation
data.frame(miss_var_summary(complete_data_rent))
# decoding variables
complete_data_rent$condition <- ifelse(complete_data_rent$condition == 0, "low", "premium")
complete_data_rent$buildingMaterial <- ifelse(complete_data_rent$buildingMaterial == 0, "brick", "concreteSlab")
complete_data_rent$type <- ifelse(complete_data_rent$type == 0, "apartmentBuilding",
                        ifelse(complete_data_rent$type == 1, "blockOfFlats", "tenement"))
complete_data_rent$hasElevator <- ifelse(complete_data_rent$hasElevator == 0, "no", "yes")
# saving imputed data
write.csv(complete_data_rent,"clean_rent.csv")

Visualizations

total_median_buy <- median(complete_data_buy$price)

plot_median_buy <- complete_data_buy %>%
  group_by(city) %>%
  summarise(median_price = median(price)) %>%
  ggplot(aes(x = reorder(city, median_price), 
             y = median_price,
             fill = median_price > total_median_buy)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("FALSE" = "#90EE90", "TRUE" = "#ffadad"), guide = "none") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Median Purchase Price by City", 
       x = "City", 
       y = "Median Price") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 


total_median_rent <- median(complete_data_rent$price)

plot_median_rent <- complete_data_rent %>%
  group_by(city) %>%
  summarise(median_price = median(price)) %>%
  ggplot(aes(x = reorder(city, median_price), 
             y = median_price,
             fill = median_price > total_median_rent)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("FALSE" = "#90EE90", "TRUE" = "#ffadad"), guide = "none") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Median Rent Price by City", 
       x = "City", 
       y = "Median Price") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(plot_median_buy, plot_median_rent, ncol = 2)

There are highest prices for housing purchasing in the largest Polish cities as it follows from the graph, where in red marked cities with median prices above overall median price. The highest prices are in Warsaw, Krakow and Gdansk, while the cheapest housings are in Czestochowa, Radom and Bydgoszcz. The rent prices are a little more close in terms of the median values, nevertheless there is an outlier - capital city Warsaw, where the median is over 30% higher than in other cities.

The impact of amenities on price

The images present scatter plots for different Polish cities, showing the relationship between apartment size (square meters) and price (PLN). The red line represents a fitted trend line for each city.

Purchase data set

ggplot(complete_data_buy, aes(x = squareMeters, y = price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "red") +
  facet_wrap(~city) +
  labs(title = "Appartemnt Price vs Size by City") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 
Warsaw, Gdańsk, Kraków, and Gdynia have steeper slopes, meaning that prices per square meter are higher. Radom, Bydgoszcz, and Częstochowa have flatter slopes, suggesting that price per square meter is lower in these cities. Some cities (e.g., Warsaw, Kraków, and Gdańsk) exhibit higher dispersion, meaning a wider range of apartment prices for similar sizes. Outliers are visible, particularly in expensive markets like Warsaw and Gdynia, where some apartments are priced significantly higher than the trend line suggests. Regional cities (e.g., Lublin, Radom, and Rzeszów) have much lower price ceilings, indicating lower overall demand and affordability.

Buy data set

ggplot(complete_data_rent, aes(x = squareMeters, y = price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "red") +
  facet_wrap(~city) +
  labs(title = "Appartemnt Price vs Size by City") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

Apartment size is a key determinant of price, but the rate at which price increases varies across cities. There is a clearly visible positive correlation between the size of an apartment and its price. The slope of the read line (smoothing) is the greatest for the cities with the most expensive apartments, so the result is correlated with conclusions made earlier.

Buy data set

complete_data_buy %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity, )) %>%
  ggplot(aes(x = has_amenity, y = price)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~amenity) +
  labs(title = "Price Distribution by Amenities") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

Rental data set

complete_data_rent %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity, )) %>%
  ggplot(aes(x = has_amenity, y = price)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~amenity) +
  labs(title = "Price Distribution by Amenities") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

The presence of amenities increases the price of housing, while there is an outlier amenity - storage room: the median price for housings with a storage room is lower than without it. The difference in price for the categories is plotted on the next graph.

pl1 <- complete_data_buy %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity)) %>%
  group_by(amenity) %>%
  mutate(
    avg_no = mean(price[has_amenity == "no"]),
    avg_yes = mean(price[has_amenity == "yes"]),
    pct_diff = (avg_yes - avg_no) / avg_no * 100
  ) %>%
  ggplot(aes(x = reorder(amenity, pct_diff), y = pct_diff, fill = pct_diff > 0)) +
  geom_col() +
  scale_fill_manual(values = c("FALSE" = "#ffadad", "TRUE" = "#90EE90"), guide = "none") +
  coord_flip() +
  labs(
    title = "Purchase data",
    x = "Amenity",
    y = "Price Difference (%)",
    fill = "Price Impact"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <- complete_data_rent %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity)) %>%
  group_by(amenity) %>%
  mutate(
    avg_no = mean(price[has_amenity == "no"]),
    avg_yes = mean(price[has_amenity == "yes"]),
    pct_diff = (avg_yes - avg_no) / avg_no * 100
  ) %>%
  ggplot(aes(x = reorder(amenity, pct_diff), y = pct_diff, fill = pct_diff > 0)) +
  geom_col() +
  scale_fill_manual(values = c("FALSE" = "#ffadad", "TRUE" = "#90EE90"), guide = "none") +
  coord_flip() +
  labs(
    title = "Rental data",
    x = "Amenity",
    y = "Price Difference (%)",
    fill = "Price Impact"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 


grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price Premium for Amenities",gp=gpar(fontsize=35)))

Amenities (all but storageRoom) seem to have a noticeable positive impact on price of the apartment. A lower purchase price for housings with storage rooms may be caused by hidden additional costs, which means that buyers are forced to pay extra for a storage room. The phenomena requires further analysis with broader range of data.

Correlation exploration

Purchase data set

distance_vars_buy <- select(complete_data_buy, contains("Distance"), price)
correlation_matrix_buy <- cor(distance_vars_buy, use = "complete.obs")
corrplot(correlation_matrix_buy, method = "color", addCoef.col = 1,
         tl.cex = 1.8, number.cex = 1.8)

There is no strong correlation found between the level of prices and distance to public places. However there is pretty solid positive correlation between variables of distances to public places.

Rental data set

distance_vars_rent <- select(complete_data_rent, contains("Distance"), price)
correlation_matrix_rent <- cor(distance_vars_rent, use = "complete.obs")
corrplot(correlation_matrix_rent, method = "color", addCoef.col = 1,
         tl.cex = 1.8, number.cex = 1.8)

Similar trends are observed on a rent market data set.

plot_age_buy <- ggplot(complete_data_buy, aes(x = buildYear, y = price)) +
  geom_smooth() +
  labs(title = "Purchase data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

plot_age_rent <- ggplot(complete_data_rent, aes(x = buildYear, y = price)) +
  geom_smooth() +
  labs(title = "Rental data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(plot_age_buy+theme(legend.position="top"),
    plot_age_rent+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price vs Building Age",gp=gpar(fontsize=35)))

Interesting things can be observed – model is less certain about the smooth price for apartments for rent built in the 70s, which might suggest there’s less of them available for rent. This phenomena is correlated with modern market rules, as the demand on the new apartments is higher. The apartments in very old buildings tend to be very expensive, but there are not many of them. Commie blocks are the least expensive for both rent and purchase. Out of modern buildings those built around 2508 seem to be the most expensive. The prices for the newest buildings seem to be falling with each year.

pl1 <- ggplot(complete_data_buy, aes(x = condition, y = price)) +
  geom_violin(fill = "beige") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <- ggplot(complete_data_rent, aes(x = condition, y = price)) +
  geom_violin(fill = "beige") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Rental data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price Distribution by Property Condition",gp=gpar(fontsize=35)))

There is no significant influence of housing condition on the price level, nevertheless its visible that premium housings are a little bit more expensive. It can be noticed that the figure of premium condition has a more curvy shape, which can be explained by larger number of observation.

Descriptive analysis

Summary statistics

data_buy <- complete_data_buy
data_rent <- complete_data_rent

data_buy <- data_buy %>% 
  mutate(price_per_m2 = price / squareMeters)
data_rent <- data_rent %>% 
  mutate(price_per_m2 = price / squareMeters)

print(dfSummary(data_buy, na.col=FALSE, headings=FALSE, valid.col=FALSE),
      method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph
1 city [character]
1. warszawa
2. krakow
3. wroclaw
4. gdansk
5. lodz
6. poznan
7. bydgoszcz
8. gdynia
9. szczecin
10. katowice
[ 5 others ]
31982(34.4%)
14186(15.3%)
9598(10.3%)
8244(8.9%)
7073(7.6%)
3639(3.9%)
3627(3.9%)
3569(3.8%)
2487(2.7%)
2397(2.6%)
6165(6.6%)
2 type [character]
1. apartmentBuilding
2. blockOfFlats
3. tenement
19050(20.5%)
59487(64.0%)
14430(15.5%)
3 squareMeters [numeric]
Mean (sd) : 56.5 (20.3)
min ≤ med ≤ max:
25 ≤ 52.6 ≤ 150
IQR (CV) : 22.8 (0.4)
7173 distinct values
4 rooms [integer]
Mean (sd) : 2.6 (0.9)
min ≤ med ≤ max:
1 ≤ 2 ≤ 6
IQR (CV) : 1 (0.3)
1:5960(6.4%)
2:40684(43.8%)
3:33745(36.3%)
4:10150(10.9%)
5:1853(2.0%)
6:575(0.6%)
5 floor [integer]
Mean (sd) : 3.3 (2.5)
min ≤ med ≤ max:
1 ≤ 3 ≤ 29
IQR (CV) : 2 (0.8)
27 distinct values
6 floorCount [integer]
Mean (sd) : 5.5 (3.3)
min ≤ med ≤ max:
1 ≤ 4 ≤ 29
IQR (CV) : 4 (0.6)
29 distinct values
7 buildYear [integer]
Mean (sd) : 1985.1 (32.5)
min ≤ med ≤ max:
1850 ≤ 1988 ≤ 2024
IQR (CV) : 46 (0)
165 distinct values
8 latitude [numeric]
Mean (sd) : 52 (1.3)
min ≤ med ≤ max:
50 ≤ 52.2 ≤ 54.6
IQR (CV) : 1.2 (0)
46449 distinct values
9 longitude [numeric]
Mean (sd) : 19.6 (1.7)
min ≤ med ≤ max:
14.4 ≤ 19.9 ≤ 23.2
IQR (CV) : 2.4 (0.1)
48233 distinct values
10 centreDistance [numeric]
Mean (sd) : 4.6 (2.8)
min ≤ med ≤ max:
0 ≤ 4.3 ≤ 16.9
IQR (CV) : 4.1 (0.6)
1487 distinct values
11 poiCount [integer]
Mean (sd) : 19.9 (22.8)
min ≤ med ≤ max:
0 ≤ 14 ≤ 212
IQR (CV) : 17 (1.1)
196 distinct values
12 schoolDistance [numeric]
Mean (sd) : 0.4 (0.5)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 4.9
IQR (CV) : 0.3 (1.1)
2505 distinct values
13 clinicDistance [numeric]
Mean (sd) : 1 (0.9)
min ≤ med ≤ max:
0 ≤ 0.7 ≤ 5
IQR (CV) : 0.9 (0.9)
4263 distinct values
14 postOfficeDistance [numeric]
Mean (sd) : 0.5 (0.5)
min ≤ med ≤ max:
0 ≤ 0.4 ≤ 5
IQR (CV) : 0.4 (0.9)
2689 distinct values
15 kindergartenDistance [numeric]
Mean (sd) : 0.4 (0.4)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 5
IQR (CV) : 0.3 (1.2)
2309 distinct values
16 restaurantDistance [numeric]
Mean (sd) : 0.3 (0.4)
min ≤ med ≤ max:
0 ≤ 0.2 ≤ 5
IQR (CV) : 0.3 (1.3)
2406 distinct values
17 collegeDistance [numeric]
Mean (sd) : 1.6 (1.2)
min ≤ med ≤ max:
0 ≤ 1.2 ≤ 5
IQR (CV) : 1.6 (0.8)
4817 distinct values
18 pharmacyDistance [numeric]
Mean (sd) : 0.4 (0.4)
min ≤ med ≤ max:
0 ≤ 0.2 ≤ 5
IQR (CV) : 0.3 (1.3)
2420 distinct values
19 ownership [character]
1. condominium
2. cooperative
3. udział
83368(89.7%)
9592(10.3%)
7(0.0%)
20 buildingMaterial [character]
1. brick
2. concreteSlab
67529(72.6%)
25438(27.4%)
21 condition [character]
1. low
2. premium
38204(41.1%)
54763(58.9%)
22 hasParkingSpace [character]
1. no
2. yes
67949(73.1%)
25018(26.9%)
23 hasBalcony [character]
1. no
2. yes
38958(41.9%)
54009(58.1%)
24 hasElevator [character]
1. no
2. yes
46645(50.2%)
46322(49.8%)
25 hasSecurity [character]
1. no
2. yes
82428(88.7%)
10539(11.3%)
26 hasStorageRoom [character]
1. no
2. yes
51899(55.8%)
41068(44.2%)
27 price [integer]
Mean (sd) : 770307.8 (397438.1)
min ≤ med ≤ max:
150000 ≤ 689000 ≤ 3250000
IQR (CV) : 379000 (0.5)
7287 distinct values
28 month [character]
1. 2023-08-01
2. 2023-09-01
3. 2023-10-01
4. 2023-11-01
5. 2023-12-01
6. 2024-01-01
7. 2024-02-01
8. 2024-03-01
9. 2024-04-01
10. 2024-05-01
11. 2024-06-01
18905(20.3%)
6098(6.6%)
7024(7.6%)
7001(7.5%)
7865(8.5%)
5020(5.4%)
7130(7.7%)
7331(7.9%)
8917(9.6%)
8634(9.3%)
9042(9.7%)
29 price_per_m2 [numeric]
Mean (sd) : 13803.1 (4926.1)
min ≤ med ≤ max:
3000 ≤ 13323 ≤ 32096.8
IQR (CV) : 6856.9 (0.4)
47186 distinct values

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-10

Summary table contains information about every column in the purchase data set. It can be noticed from the table that

  1. Every city occupies different part of the country market;
  2. The most common building type is block of flats, while apartment buildings and tenements occupy ~25% of the market each;
  3. Average housing area (m²) is 58.7, while this variable has right-skewed distribution;
  4. The most widespread are 2 and 3 room apartments, whereas other categories make up under 25% of the total;
  5. The histogram of floor values is falling, which is expected behavior;
  6. There are apartments in buildings build from 1850 to 2024. Average trend is growing, but there are some drops in certain periods;
  7. Distance variables and POI count seem to have right skewed distribution, as public facilities are erected in the most populated areas;
  8. Almost 90% of apartments are owned on condominium agreement.
print(dfSummary(data_rent, na.col=FALSE, headings=FALSE, valid.col=FALSE),
      method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph
1 city [character]
1. warszawa
2. krakow
3. wroclaw
4. lodz
5. poznan
6. gdansk
7. katowice
8. gdynia
9. bydgoszcz
10. szczecin
[ 5 others ]
14785(39.0%)
7366(19.4%)
3699(9.7%)
2647(7.0%)
2217(5.8%)
1930(5.1%)
1453(3.8%)
769(2.0%)
731(1.9%)
719(1.9%)
1625(4.3%)
2 type [character]
1. apartmentBuilding
2. blockOfFlats
3. tenement
16188(42.7%)
17006(44.8%)
4747(12.5%)
3 squareMeters [numeric]
Mean (sd) : 53.4 (21.2)
min ≤ med ≤ max:
25 ≤ 48.5 ≤ 150
IQR (CV) : 22 (0.4)
3260 distinct values
4 rooms [integer]
Mean (sd) : 2.3 (0.8)
min ≤ med ≤ max:
1 ≤ 2 ≤ 6
IQR (CV) : 1 (0.4)
1:3874(10.2%)
2:21664(57.1%)
3:9416(24.8%)
4:2271(6.0%)
5:504(1.3%)
6:212(0.6%)
5 floor [integer]
Mean (sd) : 3.3 (2.6)
min ≤ med ≤ max:
1 ≤ 3 ≤ 30
IQR (CV) : 2 (0.8)
27 distinct values
6 floorCount [integer]
Mean (sd) : 5.7 (3.4)
min ≤ med ≤ max:
1 ≤ 5 ≤ 30
IQR (CV) : 3 (0.6)
30 distinct values
7 buildYear [integer]
Mean (sd) : 1995.8 (31.7)
min ≤ med ≤ max:
1850 ≤ 2009 ≤ 2024
IQR (CV) : 41 (0)
148 distinct values
8 latitude [numeric]
Mean (sd) : 51.7 (1.2)
min ≤ med ≤ max:
50 ≤ 52.2 ≤ 54.6
IQR (CV) : 1.2 (0)
22850 distinct values
9 longitude [numeric]
Mean (sd) : 19.7 (1.7)
min ≤ med ≤ max:
14.4 ≤ 20 ≤ 23.2
IQR (CV) : 2.4 (0.1)
23409 distinct values
10 centreDistance [numeric]
Mean (sd) : 3.9 (2.6)
min ≤ med ≤ max:
0 ≤ 3.5 ≤ 16.6
IQR (CV) : 3.7 (0.7)
1302 distinct values
11 poiCount [integer]
Mean (sd) : 24.2 (26.2)
min ≤ med ≤ max:
0 ≤ 16 ≤ 210
IQR (CV) : 21 (1.1)
188 distinct values
12 schoolDistance [numeric]
Mean (sd) : 0.4 (0.3)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 4.9
IQR (CV) : 0.3 (0.9)
1667 distinct values
13 clinicDistance [numeric]
Mean (sd) : 0.8 (0.7)
min ≤ med ≤ max:
0 ≤ 0.6 ≤ 5
IQR (CV) : 0.7 (0.9)
3162 distinct values
14 postOfficeDistance [numeric]
Mean (sd) : 0.5 (0.4)
min ≤ med ≤ max:
0 ≤ 0.4 ≤ 4.9
IQR (CV) : 0.4 (0.8)
1878 distinct values
15 kindergartenDistance [numeric]
Mean (sd) : 0.3 (0.3)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 4.8
IQR (CV) : 0.2 (1)
1464 distinct values
16 restaurantDistance [numeric]
Mean (sd) : 0.3 (0.3)
min ≤ med ≤ max:
0 ≤ 0.2 ≤ 5
IQR (CV) : 0.2 (1.2)
1455 distinct values
17 collegeDistance [numeric]
Mean (sd) : 1.4 (1.1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 5
IQR (CV) : 1.4 (0.8)
4153 distinct values
18 pharmacyDistance [numeric]
Mean (sd) : 0.3 (0.3)
min ≤ med ≤ max:
0 ≤ 0.2 ≤ 5
IQR (CV) : 0.3 (1.1)
1489 distinct values
19 ownership [character] 1. condominium
37941(100.0%)
20 buildingMaterial [character]
1. brick
2. concreteSlab
33271(87.7%)
4670(12.3%)
21 condition [character]
1. low
2. premium
865(2.3%)
37076(97.7%)
22 hasParkingSpace [character]
1. no
2. yes
23159(61.0%)
14782(39.0%)
23 hasBalcony [character]
1. no
2. yes
14505(38.2%)
23436(61.8%)
24 hasElevator [character]
1. no
2. yes
13664(36.0%)
24277(64.0%)
25 hasSecurity [character]
1. no
2. yes
32234(85.0%)
5707(15.0%)
26 hasStorageRoom [character]
1. no
2. yes
31784(83.8%)
6157(16.2%)
27 price [integer]
Mean (sd) : 3682.3 (2186)
min ≤ med ≤ max:
346 ≤ 3000 ≤ 23000
IQR (CV) : 1750 (0.6)
996 distinct values
28 month [character]
1. 2023-11-01
2. 2023-12-01
3. 2024-01-01
4. 2024-02-01
5. 2024-03-01
6. 2024-04-01
7. 2024-05-01
8. 2024-06-01
8533(22.5%)
2790(7.4%)
1865(4.9%)
5278(13.9%)
5160(13.6%)
4953(13.1%)
4641(12.2%)
4721(12.4%)
29 price_per_m2 [numeric]
Mean (sd) : 68.7 (22.6)
min ≤ med ≤ max:
11.3 ≤ 65.7 ≤ 189.5
IQR (CV) : 27.5 (0.3)
9600 distinct values

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-10

Rental market data in general follow the same trend, however there are some differences:

  1. Apartment buildings are the most common type of building the housing is located in;
  2. Area of the apartments is a bit lower (average 53.4 m²) than in case of purchase market;
  3. Two room apartments are the most represented category with over 57% of items;
  4. The average of build year is higher for rental market (by 12 years), which is explained by higher attractiveness of modern buildings for tenants;
  5. 97.7% of the apartments is declared to be in premium condition.

Share of market by city with contingency tables

table <- table(data_buy$city) %>% 
  as.data.frame() 

table$Share <- round(prop.table(table$Freq), 4)*100 

arrange(table, desc(Share)) %>% 
  kable(col.names=c("City", "Quantity", "Share, %")) %>% 
  kable_styling()
City Quantity Share, %
warszawa 31982 34.40
krakow 14186 15.26
wroclaw 9598 10.32
gdansk 8244 8.87
lodz 7073 7.61
poznan 3639 3.91
bydgoszcz 3627 3.90
gdynia 3569 3.84
szczecin 2487 2.68
katowice 2397 2.58
lublin 2299 2.47
czestochowa 1242 1.34
bialystok 1017 1.09
radom 845 0.91
rzeszow 762 0.82

It follows from the data that the largest share of the apartments market is in Warsaw (above 30% of the country market), smaller parts of the market represent next major cities: Kraków (15.26%), Wrocaław (10.32%) and Gdańsk (8.87%).

table <- table(data_rent$city) %>% 
  as.data.frame() 

table$Share <- round(prop.table(table$Freq), 4)*100 

arrange(table, desc(Share)) %>% 
  kable(col.names=c("City", "Quantity", "Share, %")) %>% 
  kable_styling()
City Quantity Share, %
warszawa 14785 38.97
krakow 7366 19.41
wroclaw 3699 9.75
lodz 2647 6.98
poznan 2217 5.84
gdansk 1930 5.09
katowice 1453 3.83
gdynia 769 2.03
bydgoszcz 731 1.93
szczecin 719 1.90
lublin 632 1.67
czestochowa 345 0.91
rzeszow 310 0.82
bialystok 223 0.59
radom 115 0.30

Warsaw has even larger part of the rent market almost 40%, growth also noticeable in the market share of Krakow (+4%). Remaining cities have minor shares of this market (under 10% each).

Price distribution shape with tabular data

# range(data_buy$price)

limits <- seq(min(data_buy$price)-500000, max(data_buy$price)+500000, by=100000)

limits_factor <- cut(data_buy$price, limits)
table_price <- table(limits_factor, dnn = c("Limits"))
# transform(table_price, Percentage=prop.table(Freq)) #, Cum_Freq=cumsum(Freq))

hist(data_buy$price, prob=TRUE, breaks=limits, main="Price groupping for purchase market", xlab="Price in PLN", col="lightblue", cex.lab=3, cex.main=3, cex.axis=3)
lines(density(data_buy$price), col=2)

print("TAI (tabular accuracy index)")
## [1] "TAI (tabular accuracy index)"
tab_accuracy <- classIntervals(data_buy$price, n=10, style="fixed", fixedBreaks=limits)
#tabular accuracy index
jenks.tests(tab_accuracy)
##        # classes  Goodness of fit Tabular accuracy 
##       41.0000000        0.9947621        0.9148005
pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3', 'red', 'yellow', 'brown', 'black')
colcode=findColours(tab_accuracy, pcol)
plot(tab_accuracy, pal=pcol, ylab='Cumulative %', cex.lab=3, cex.main=3, cex.axis=3)

The price has skewed distribution distribution. A histogram was build to show how many housings fall in certain price levels from minimum price 150000 up to the maximum 3250000 by 100000 PLN. The histogram shows that most represented category is 600-700 thousands PLN. The partition accuracy was checked with the tabular accuracy index, which shows high tabular accuracy and goodness of fit.

# range(data_rent$price)
limits <- seq(min(data_rent$price) - 300, max(data_rent$price) + 300, by=500)

limits_factor <- cut(data_rent$price, limits)
table_price <- table(limits_factor, dnn = c("Limits"))
# transform(table_price, Percentage=prop.table(Freq)) #, Cum_Freq=cumsum(Freq))

hist(data_rent$price, prob=TRUE, breaks=limits, main="Price groupping for rental market", xlab="Price in PLN", col="lightblue", cex.lab=3, cex.main=3, cex.axis=3)
lines(density(data_rent$price), col=2)

print("TAI (tabular accuracy index)")
## [1] "TAI (tabular accuracy index)"
tab_accuracy <- classIntervals(data_rent$price, n=10, style="fixed", fixedBreaks=limits)
#tabular accuracy index
jenks.tests(tab_accuracy)
##        # classes  Goodness of fit Tabular accuracy 
##       46.0000000        0.9959035        0.9154236
pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3', 'red', 'yellow', 'brown', 'black')
colcode=findColours(tab_accuracy, pcol)
plot(tab_accuracy, pal=pcol, ylab='Cumulative %', cex.lab=3, cex.main=3, cex.axis=3)

The shapes of the rental market distribution is narrower than purchase market one, which can be explained by the target groups of these offers (often these group cannot afford to buy a housing).

pl1 <- data_buy %>%
  ggplot(aes(x=price_per_m2)) +
  geom_density(color="#69b3a2", alpha=0.8, size=1.5) +
  geom_vline(aes(xintercept=median(price_per_m2)), linetype="dashed", size=0.8, color="69b3a2") +
  annotate("text", x=12100, y=0.00004, label=paste("Median: ", as.character(median(round(data_buy$price_per_m2, 2)))), angle=90, color="69b3a2", size=15) +
  ggtitle("Purchase market") + 
  labs(x="PLN per square meter", y="Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <- data_rent %>%
  ggplot(aes(x=price_per_m2)) +
  geom_density(color="#69b3a2", alpha=0.8, size=1.5) +
  geom_vline(aes(xintercept=median(price_per_m2)), linetype="dashed", size=0.8, color="69b3a2") +
  annotate("text", x=57, y=0.01, label=paste("Median: ", as.character(median(round(data_rent$price_per_m2, 2)))), angle=90, color="69b3a2", size=15) +
  ggtitle("Rental market") + 
  labs(x="PLN per square meter", y="Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 


grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price per square meter",gp=gpar(fontsize=35)))

This image shows density plots for the price per square meter in both the purchase and rental markets. The distribution graphs of prices per square meter are less skewed than normal prices and additionally:

  1. Purchase Market

    • The distribution appears right-skewed (positively skewed), meaning more properties have prices below the median, with some high-priced outliers.

    • The median price per square meter is 13,322.98 PLN.

    • There’s a peak around 10,000-15,000 PLN, suggesting a common price range.

  2. Rental Market

    • Also right-skewed, but with a sharper peak and a steeper decline.

    • The median rental price per square meter is 65.67 PLN.

    • Most rentals are concentrated in the 40-80 PLN per square meter range.

Housing type percentage in different cities

type_across_cities <- table(data_buy$city, data_buy$type, dnn = c("City", "Type"))
addmargins(round(prop.table(type_across_cities, 1) * 100, 2), 2) %>% 
  kable() %>% 
  kable_styling()
apartmentBuilding blockOfFlats tenement Sum
bialystok 11.50 85.25 3.24 99.99
bydgoszcz 8.77 65.95 25.28 100.00
czestochowa 7.33 85.35 7.33 100.01
gdansk 19.30 64.85 15.85 100.00
gdynia 20.62 67.39 11.99 100.00
katowice 19.40 56.86 23.74 100.00
krakow 20.05 65.68 14.27 100.00
lodz 15.07 69.70 15.23 100.00
lublin 3.48 87.12 9.40 100.00
poznan 14.87 64.91 20.23 100.01
radom 3.08 89.23 7.69 100.00
rzeszow 12.73 82.94 4.33 100.00
szczecin 8.04 62.08 29.88 100.00
warszawa 26.58 60.10 13.32 100.00
wroclaw 24.76 55.13 20.12 100.01

The table shows percentage of housing of the certain type per city. There are cities with higher percentage of apartment buildings (Gdansk, Gdynia, Krakow, Warszawa and Wroclaw), which are characterized by greater market share and higher prices, while in smaller cities with lower prices relative quantity of offers in blocks of flats is higher (Bialystok, Czestohchowa, Lublina, Radom). Nonetheless there is a common trend for each of the cities: the majority of housings is offered in blocks of flats, while apartments of buildings and tenements make up a smaller group of housings.

type_across_cities <- table(data_rent$city, data_rent$type, dnn = c("City", "Type"))
addmargins(round(prop.table(type_across_cities, 1) * 100, 2), 2) %>% 
  kable() %>% 
  kable_styling()
apartmentBuilding blockOfFlats tenement Sum
bialystok 26.01 70.40 3.59 100.00
bydgoszcz 21.07 57.73 21.20 100.00
czestochowa 16.52 72.17 11.30 99.99
gdansk 38.86 51.71 9.43 100.00
gdynia 26.92 61.51 11.57 100.00
katowice 39.99 48.80 11.22 100.01
krakow 41.35 46.17 12.48 100.00
lodz 45.86 45.67 8.46 99.99
lublin 19.30 67.25 13.45 100.00
poznan 31.75 56.52 11.73 100.00
radom 20.00 72.17 7.83 100.00
rzeszow 28.71 66.13 5.16 100.00
szczecin 27.96 59.67 12.38 100.01
warszawa 49.34 36.02 14.64 100.00
wroclaw 45.61 45.07 9.33 100.01

The situation is somewhat different for the rental market. There is an increase in apartment buildings type offers resulting in an incline in percentage of block of flats. Quantity of offers in tenements is also reduced insignificantly.

Descriptive characteristics of price per m² in purchase data set

type mean sd median min max skewness kurtosis
apartmentBuilding 16835.35 4883.377 16614.36 4137.931 32096.77 0.3299874 2.672051
blockOfFlats 12820.99 4233.305 12531.38 3435.864 30935.48 0.5797487 3.329609
tenement 13848.69 5904.646 12983.91 3000.000 31041.67 0.4951448 2.487640

The most expensive in terms of price per square meter are housings in apartment buildings. Their average price is 26.8% more for blocks of flats and 25.1% for tenements in purchasing. The smallest standard deviation** value for blocks of flats in both markets can be explained by average characteristics of this type of housing: locations and conditions are close in different blocks of flats or apartments, while there are old, cheaper tenements and relatively small amount of the new ones in perfect condition and location so the deviation is significantly larger. Positive values of skewness indicate the tail of the distribution is on the right, which means the amount of more expensive apartments is fading more gradually than the cheaper ones. Kurtosis indicates quite peaked distributions for each category.

Descriptive characteristics of price per m² in rental data set

type mean sd median min max skewness kurtosis
apartmentBuilding 75.76773 23.22751 72.37607 11.30396 187.8378 1.1240927 5.142059
blockOfFlats 61.56362 19.15648 59.25926 11.29861 189.4737 0.9502302 5.420271
tenement 70.01260 24.04812 67.30769 11.29971 189.3146 0.7050550 3.895923

Apartments in apartment buildings have the highest average rent (75.77 PLN/m²) and median (72.38 PLN/m²), suggesting they tend to be more expensive. Block of flats has the lowest average (61.56 PLN/m²) and median (59.26 PLN/m²), meaning they are generally more affordable, while tenements are in between as it was for purchasing data set. Standard deviation is highest for tenements (24.05 PLN/m²) and apartment buildings (23.23 PLN/m²), indicating a wider spread in rental prices. The maximum rental price is quite high for all categories (~189 PLN/m²), suggesting some luxury or premium properties significantly impact the upper range. Apartment buildings (5.14) and block of flats (5.42) have higher kurtosis, indicating a sharper peak with more extreme values. Tenements (3.90) have a more moderate kurtosis, meaning a more balanced distribution without extreme outliers.

mu_buy <- ddply(data_buy, "type", summarise, grp.mean=mean(price_per_m2))

pl1 <- ggplot(data_buy, aes(x=price_per_m2, color=type)) +
  geom_density(size=1.5) +
  theme(legend.position="top") +
  geom_vline(data=mu_buy, aes(xintercept=grp.mean, color=type), linetype="dashed") +    
  scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9")) +
  theme_classic() +
  labs(x="PLN per square meter", y="Density", subtitle  = "Purchase market") +
  theme() +
  theme(text = element_text(size = 25)) 

mu_rent <- ddply(data_rent, "type", summarise, grp.mean=mean(price_per_m2))

pl2 <- ggplot(data_rent, aes(x=price_per_m2, color=type)) +
  geom_density(size=1.5) +
  geom_vline(data=mu_rent, aes(xintercept=grp.mean, color=type), linetype="dashed") +   
  scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9")) + 
  theme_classic() +
  labs(x="PLN per square meter", y="Density", subtitle  = "Rental market") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Distribution of price per square meter by housing type",gp=gpar(fontsize=35))
)

This image presents density plots for the distribution of price per square meter in both the purchase and rental markets, categorized by housing type (apartment buildings, blocks of flats and tenements). Purchase prices are more spread out than rental prices, particularly for apartment buildings. Rental prices are relatively more concentrated, with fewer extreme values. Blocks of flats consistently show the lowest prices and least variation, making them the most affordable option. Apartment buildings are the most expensive, both in purchase and rental markets, with more high-priced outliers.

pl1 <- ggplot(data_buy, aes(x=type,  fill=buildingMaterial))+
  geom_bar(stat="count", width=0.95) + 
  scale_fill_manual(values=c("darkred", "grey")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <-ggplot(data_rent, aes(x=type,  fill=buildingMaterial))+
  geom_bar(stat="count", width=0.95) + 
  scale_fill_manual(values=c("darkred", "grey")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Housing type vs building material",gp=gpar(fontsize=35))
)

The majority of housings is build of brick, while this material is used more frequently in apartment buildings and tenements than in blocks of flats, which are made of concrete slab in one third of cases. The trend is common for purchasing and rental markets.

Statistical tests

Statistical tests are conducted to determine whether observed data patterns are due to chance or represent real effects. They help in making objective decisions, validating hypotheses, and drawing conclusions in research by measuring relationships, differences, or associations within data sets.

Statistical significance of price median difference between cities

The main purpose of the statistical test we are going to perform is to determine which variables have a statistically significant impact on both rent and buy prices.

palette_colors <- viridis(length(unique(complete_data_buy$city)))

complete_data_buy %>% 
  ggbetweenstats(
    y = price,
    x = city,
    bf.message = FALSE,
    pairwise.display = "none",
    title = "Distribution of buy price across cities - buy dataset",
    palette = "default_ucscgb",
    package = "ggsci",
    centrality.label.args = list(size  = 7)
  ) +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25)) 

complete_data_rent %>% 
  ggbetweenstats(
    y=price,
    x=city,
    bf.message = FALSE,
    pairwise.display = "none",
    title = "Distribution of rent price across cities - rent dataset",
    palette = "default_ucscgb",
    package = "ggsci",
    centrality.label.args = list(size  = 7)
  ) +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25)) 

Based on Welch’s test, we obtained a p-value close to 0, so we reject the null hypothesis of equal means between populations in both the buy and rent data sets. This means that the city statistically significantly differentiates both rent and buy prices. However, an effect size of 0.86 implies that this differentiation has a moderate magnitude.

Statistical significance differentiation of price by amenities

amentities <- c("hasBalcony", "hasElevator", "hasParkingSpace", "hasStorageRoom", "hasSecurity")
lst <- list()
for (i in amentities) {
  lst[[i]] <-  
      ggbetweenstats(complete_data_buy,
        y = price,
        x = !!sym(i),
        bf.message = FALSE,
        palette = "default_ucscgb",
        package = "ggsci",
        title = i,
        centrality.label.args = list(size  = 7),
        ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
      ) +
      theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))  +
      xlab("")
}

grid.arrange(grobs=lst,
    ncol = 3,
    nraw = 2,
    top =textGrob("Buy Price Distribution by amenities",gp=gpar(fontsize=35))
)

All amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.14 for a balcony to -0.57 for an elevator.

for (i in amentities) {
  lst[[i]] <-  
      ggbetweenstats(complete_data_buy,
        y = price,
        x = !!sym(i),
        bf.message = FALSE,
        palette = "default_ucscgb",
        package = "ggsci",
        title = i,
        centrality.label.args = list(size  = 7),
        ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
      ) +
      theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))  +
      xlab("")
}

grid.arrange(grobs=lst,
    ncol = 3,
    nraw = 2,
    top =textGrob("Rent Price Distribution by amenities",gp=gpar(fontsize=35))
)

Similarly to buy prices, all amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.10 for a balcony to -0.31 for an elevator.

pl1 <-  ggbetweenstats(complete_data_buy,
    y = price,
    x = condition,
    bf.message = FALSE,
    palette = "default_ucscgb",
    package = "ggsci",
    title = "Buy data set",
    centrality.label.args = list(size  = 10),
    ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))  

pl2 <-  ggbetweenstats(complete_data_rent,
    y = price,
    x = condition,
    bf.message = FALSE,
    palette = "default_ucscgb",
    package = "ggsci",
    title = "Rental data set",
    centrality.label.args = list(size  = 10),
    ggplot.component = list(theme(plot.subtitle = element_text(size = 12, face = "bold")))
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25)) +
  ylab("")

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price difference by condition",gp=gpar(fontsize=35))
)

Finally, condition of flat in both rent and buy data sets also seem to differentiate price significantly but has only moderate magnitude.

Conclusions

This project provides a comprehensive analysis of the real estate market, utilizing statistical methods and visualizations to uncover key trends and insights. The study examines data structure, performs data cleansing, and addresses outliers to ensure accuracy. Through descriptive analysis and summary statistics, the project identifies variations in market share across cities and explores price distribution patterns.

The following problems were explored among others:

Findings and insights

Certain cities (e.g., Gdańsk, Gdynia, Częstochowa) show significant missing data in condition and build year attributes, however there is no strong correlation between missing values and other features. The highest property prices are in Warsaw, Krakow, and Gdansk, while the cheapest cities are Czestochowa, Radom, and Bydgoszcz. Property prices follow a fat-tailed distribution, while high-priced properties are concentrated in city centers with dense places of interest (PoI). No strong correlation is found between price levels and proximity to public places, but distances between amenities show a strong relationship.

Rental and purchase price distributions by cities, amenities and condition differ significantly, as confirmed by Welch’s test (p-value ≈ 0). Rental prices are more uniform, except in Warsaw, where the median rent is 30% higher than in other cities. A positive correlation exists between apartment size and price, with larger price increases in cities with higher property costs. Amenities generally increase housing prices, except for storage rooms, which correlate with lower prices, but the effect size is small (e.g., -0.14 for balconies, -0.57 for elevators). Housing condition has a moderate impact on prices, but its effect size is limited.

The study confirms that city location, property size, and amenities significantly influence real estate prices. While rental prices show more uniformity, purchase prices exhibit stronger variations based on property type and city. Warsaw dominates both markets, with modern buildings losing value over time. Statistical tests validate these findings, providing insights for investors, buyers, and policymakers.