Introduction

The real estate market plays a crucial role in economic growth, urban development, and investment strategies. Understanding its structure, pricing trends, and key influencing factors is essential for stakeholders, including buyers, investors, and policymakers. This project provides an in-depth analysis of the real estate market using statistical methods and data visualization techniques to extract meaningful insights from the available data.

The following milestones were outlined during design process:

Explatory Analysis: Understanding the format, types, and sources of data.
Data Cleansing and Wrangling: Ensuring data accuracy, handling missing values, and standardizing formats.
Outliers: Identifying and analyzing unusual price points that may affect overall trends.
Visualizations: Graphical representations to facilitate pattern recognition and trend analysis.
Descriptive Analysis: Exploring key metrics like mean, median, variance, and skewness, recovered aggregated statistical insights that describe the market dynamics.
Price Distribution Shape: Analyzing the distribution of prices through density plots and statistical summaries.
Statistical Tests: Assessing the statistical significance of median price differences across cities and investigating the impact of amenities on price variations.

Objectives

Provide a clear understanding of real estate price structures.
Identify significant price differentiators and market trends.
Utilize statistical techniques to validate insights.
Present findings through comprehensive visual and tabular representations.

## Downloading necessary libraries
# Installing the libraries
if (!require(tidyverse)) install.packages("tidyverse") 
if (!require(dplyr)) install.packages("dplyr")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(naniar)) install.packages("naniar")
if (!require(mice)) install.packages("mice")
if (!require(modelsummary)) install.packages("modelsummary")
if (!require(reshape2)) install.packages("reshape2")
if (!require(editrules)) install.packages("editrules")
if (!require(dlookr)) install.packages("dlookr")
if (!require(editrules)) install.packages("editrules")
if (!require(VIM)) install.packages("VIM")
if (!require(deducorrect)) install.packages("deducorrect")
if (!require(ISLR)) install.packages("ISLR")
if (!require(outliers)) install.packages("outliers")
if (!require(ggstatsplot)) install.packages("ggstatsplot")
if (!require(scales)) install.packages("scales")
if (!require(corrplot)) install.packages("corrplot")
if (!require(gridExtra)) install.packages("grid.extra")
if (!require(kableExtra)) install.packages("kable.extra")
if (!require(plyr)) install.packages("plyr")
if (!require(summarytools)) install.packages("summarytools")
if (!require(classInt)) install.packages("classInt")
if (!require(psych)) install.packages("psych")
if (!require(here)) install.packages("here")
if (!require(tidyr)) install.packages("tidyr")
if (!require(plotly)) install.packages("plotly")
if (!require(egg)) install.packages("egg")

# Attaching the libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(naniar)
library(mice)
library(modelsummary)
library(reshape2)
library(editrules)
library(dlookr)
library(editrules)
library(VIM)
library(deducorrect)
library(ISLR)
library(outliers)
library(ggstatsplot)
library(RColorBrewer)
library(viridis)
library(rlang)
library(scales)
library(corrplot)
library(gridExtra)
library(kableExtra)
library(plyr)
library(summarytools)
library(classInt)
library(psych)
library(here)
library(tidyr)
library(plotly)
library(egg)

Data structure

The dataset contains apartment sales and rental offers from 15 largest cities in Poland (Warsaw, Lodz, Krakow, Wroclaw, Poznan, Gdansk, Szczecin, Bydgoszcz, Lublin, Katowice, Bialystok, Czestochowa). The data comes from local websites with apartments for sale. In order to fully capture the neighborhood of each apartment, each listing has been augmented with Open Street Map data with distances to points of interest (POIs). The data is collected monthly and covers the period from August 2023 to June 2024.

Files

apartments_pl_YYY_MM.csv - monthly snapshot of sales listings

apartments_rent_pl_YYYY_MM.csv - monthly snapshot of rental offers

Data fields

city - name of the city where the property is located
type - type of building
squareMeters - size of the apartment in square meters
rooms - number of rooms in the apartment
floor / floorCount - floor on which the apartment is located and the total number of floors in the building
buildYear - the year in which the building was built
latitude, longitude - geographical coordinates of the property
centerDistance - distance from the city center in km
poiCount - number of places of interest within a radius of 500m from the apartment (schools, clinics, post office, kindergartens, restaurants, universities, pharmacies)
Distance - distance to the nearest point of interest (schools, clinics, post office, etc.)

Important data manipulations conducted in this step are adding months of when the data come from to the dataset, and binding files. In the raw state of data, this date is only included in the name of the file, and also we have separate files for each month.

# Loading data
load_data <- function(name_struct) {
  files_names = 
    # Listing files in the proper location and with given pattern
    list.files(
    path = file.path("./analiza_danych_projekt_zespolowy/Nieruchomosci w Polsce"),
    pattern = name_struct, 
    full.names = TRUE)
  myfiles = lapply(files_names, read.csv)
  
  # adding column "month", labeling data from which month it was scrapped
  for (i in seq_along(myfiles)) {myfiles[[i]]$month = to_date(files_names[i])}
  
  # bind together dataframe
  return(do.call(rbind, myfiles))
}

# adding new column to each DF with month
to_date <- function(primary_date) {
  date_part <- sub(".*(\\d{4}_\\d{2}).*", "\\1", primary_date)
  complete_date <- paste0(date_part, "_01")
  
  parsed_date <- 
    as.Date(complete_date, format = "%Y_%m_%d")
  
  return(parsed_date)
}

data_buy = load_data("^apartments_pl_[0-9]+_[0-9]+\\.csv$")
data_rent = load_data("^apartments_rent_pl_[0-9]+_[0-9]+\\.csv$")

Data Cleansing and Wrangling

Data cleansing and wrangling are essential steps in the data analysis process, ensuring that raw data is transformed into a structured, accurate, and usable format. Data cleansing involves among others handling missing values, outliers and identifying and correcting errors. Data wrangling focuses on reshaping, merging, and transforming datasets to make them suitable for analysis.

The goal of these processes is to enhance data integrity, eliminate inconsistencies, and prepare the dataset for meaningful insights.

Removing unnecessary data

Row IDs are unique and therefore provide no differentiation or analytical value. Additionally, due to the nature of the data, duplicate values are possible. For example, if a property was listed for sale in January but sold in March, it would appear three times—once for each month it was listed.

n_occur_buy <- data.frame(table(data_buy$id))
# n_occur_buy[n_occur_buy$Freq > 1,]

n_occur_rent <- data.frame(table(data_rent$id))
# n_occur_rent[n_occur_rent$Freq > 1,]

data_buy <- data_buy[!duplicated(data_buy$id),]
data_rent <- data_rent[!duplicated(data_rent$id),]

data_buy <- data_buy %>% 
  select(-id)

data_rent <- data_rent %>% 
  select(-id)

Data summary

datasummary_skim(data_buy)

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
squareMeters	7173	0	56.5	20.3	25.0	52.6	150.0
rooms	6	0	2.6	0.9	1.0	2.0	6.0
floor	28	17	3.4	2.6	1.0	3.0	29.0
floorCount	30	1	5.5	3.4	1.0	4.0	29.0
buildYear	166	17	1987.1	31.9	1850.0	1994.0	2024.0
latitude	46449	0	52.0	1.3	50.0	52.2	54.6
longitude	48233	0	19.6	1.7	14.4	19.9	23.2
centreDistance	1487	0	4.6	2.8	0.0	4.3	16.9
poiCount	196	0	19.9	22.8	0.0	14.0	212.0
schoolDistance	2506	0	0.4	0.5	0.0	0.3	4.9
clinicDistance	4264	0	1.0	0.9	0.0	0.7	5.0
postOfficeDistance	2690	0	0.5	0.5	0.0	0.4	5.0
kindergartenDistance	2310	0	0.4	0.4	0.0	0.3	5.0
restaurantDistance	2407	0	0.3	0.4	0.0	0.2	5.0
collegeDistance	4818	3	1.5	1.1	0.0	1.2	5.0
pharmacyDistance	2421	0	0.4	0.4	0.0	0.2	5.0
price	7287	0	770307.8	397438.1	150000.0	689000.0	3250000.0
		N	%
city	bialystok	1017	1.1
	bydgoszcz	3627	3.9
	czestochowa	1242	1.3
	gdansk	8244	8.9
	gdynia	3569	3.8
	katowice	2397	2.6
	krakow	14186	15.3
	lodz	7073	7.6
	lublin	2299	2.5
	poznan	3639	3.9
	radom	845	0.9
	rzeszow	762	0.8
	szczecin	2487	2.7
	warszawa	31982	34.4
	wroclaw	9598	10.3
type		19792	21.3
	apartmentBuilding	14947	16.1
	blockOfFlats	46347	49.9
	tenement	11881	12.8
ownership	condominium	83368	89.7
	cooperative	9592	10.3
	udział	7	0.0
buildingMaterial		39179	42.1
	brick	40522	43.6
	concreteSlab	13266	14.3
condition		69914	75.2
	low	10373	11.2
	premium	12680	13.6
hasParkingSpace	no	67949	73.1
	yes	25018	26.9
hasBalcony	no	38958	41.9
	yes	54009	58.1
hasElevator		4448	4.8
	no	43155	46.4
	yes	45364	48.8
hasSecurity	no	82428	88.7
	yes	10539	11.3
hasStorageRoom	no	51899	55.8
	yes	41068	44.2

datasummary_skim(data_rent)

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
squareMeters	3260	0	53.4	21.2	25.0	48.5	150.0
rooms	6	0	2.3	0.8	1.0	2.0	6.0
floor	28	12	3.4	2.6	1.0	3.0	30.0
floorCount	31	2	5.7	3.4	1.0	5.0	30.0
buildYear	149	28	1998.8	29.8	1850.0	2010.0	2024.0
latitude	22850	0	51.7	1.2	50.0	52.2	54.6
longitude	23409	0	19.7	1.7	14.4	20.0	23.2
centreDistance	1302	0	3.9	2.6	0.0	3.5	16.6
poiCount	188	0	24.2	26.2	0.0	16.0	210.0
schoolDistance	1668	0	0.4	0.3	0.0	0.3	4.9
clinicDistance	3163	0	0.8	0.7	0.0	0.6	5.0
postOfficeDistance	1879	0	0.5	0.4	0.0	0.4	4.9
kindergartenDistance	1465	0	0.3	0.3	0.0	0.3	4.8
restaurantDistance	1456	0	0.3	0.3	0.0	0.2	5.0
collegeDistance	4154	1	1.3	1.0	0.0	1.0	5.0
pharmacyDistance	1490	0	0.3	0.3	0.0	0.2	5.0
price	996	0	3682.3	2186.0	346.0	3000.0	23000.0
		N	%
city	bialystok	223	0.6
	bydgoszcz	731	1.9
	czestochowa	345	0.9
	gdansk	1930	5.1
	gdynia	769	2.0
	katowice	1453	3.8
	krakow	7366	19.4
	lodz	2647	7.0
	lublin	632	1.7
	poznan	2217	5.8
	radom	115	0.3
	rzeszow	310	0.8
	szczecin	719	1.9
	warszawa	14785	39.0
	wroclaw	3699	9.7
type		9190	24.2
	apartmentBuilding	12495	32.9
	blockOfFlats	12590	33.2
	tenement	3666	9.7
	condominium	37941	100.0
buildingMaterial		16157	42.6
	brick	19939	52.6
	concreteSlab	1845	4.9
condition		27790	73.2
	low	123	0.3
	premium	10028	26.4
hasParkingSpace	no	23159	61.0
	yes	14782	39.0
hasBalcony	no	14505	38.2
	yes	23436	61.8
hasElevator		2096	5.5
	no	12310	32.4
	yes	23535	62.0
hasSecurity	no	32234	85.0
	yes	5707	15.0
hasStorageRoom	no	31784	83.8
	yes	6157	16.2

A brief exploration of the dataset reveals that both the rent and buy datasets share the same 28 features, which include both categorical and continuous variables. The buy dataset contains 92 967 unique records, while the rent dataset has 37 941 unique records. An interesting observation is that the data includes not only missing values (NAs) but also different type of blank entries, which will be addressed in the next section.

Checking NAs

# Checking the nature of "other" blanks and replacing them with NA
data_buy[5,"buildingMaterial"]

## [1] ""

data_buy[data_buy==""] <- NA
data_rent[data_rent==""] <- NA

knitr::kable(data.frame(miss_var_summary(data_buy)))

variable	n_miss	pct_miss
condition	69914	75.2
buildingMaterial	39179	42.1
type	19792	21.3
floor	15984	17.2
buildYear	15641	16.8
hasElevator	4448	4.78
collegeDistance	2492	2.68
floorCount	1082	1.16
clinicDistance	333	0.358
restaurantDistance	226	0.243
pharmacyDistance	128	0.138
postOfficeDistance	99	0.106
kindergartenDistance	85	0.0914
schoolDistance	60	0.0645
city	0	0
squareMeters	0	0
rooms	0	0
latitude	0	0
longitude	0	0
centreDistance	0	0
poiCount	0	0
ownership	0	0
hasParkingSpace	0	0
hasBalcony	0	0
hasSecurity	0	0
hasStorageRoom	0	0
price	0	0
month	0	0

knitr::kable(data.frame(miss_var_summary(data_rent)))

variable	n_miss	pct_miss
condition	27790	73.2
buildingMaterial	16157	42.6
buildYear	10570	27.9
type	9190	24.2
floor	4646	12.2
hasElevator	2096	5.52
floorCount	730	1.92
collegeDistance	490	1.29
restaurantDistance	82	0.216
pharmacyDistance	38	0.100
clinicDistance	32	0.0843
kindergartenDistance	28	0.0738
postOfficeDistance	19	0.0501
schoolDistance	11	0.0290
city	0	0
squareMeters	0	0
rooms	0	0
latitude	0	0
longitude	0	0
centreDistance	0	0
poiCount	0	0
ownership	0	0
hasParkingSpace	0	0
hasBalcony	0	0
hasSecurity	0	0
hasStorageRoom	0	0
price	0	0
month	0	0

Both data sets have some serious problems with missing data, especially in condition and building material.

Data set ‘buy’

# Upset plot of missing data and number of missings by a city - buy dataset
NA_vector_buy <- c("floor", "buildYear", "collegeDistance", "floorCount", 
               "clinicDistance", "restaurantDistance", "pharmacyDistance", 
               "postOfficeDistance", "kindergartenDistance", "schoolDistance",
               "condition", "buildingMaterial", "type", "hasElevator")

gg_miss_upset(data_buy, nsets=14, mb.ratio = c(0.5, 0.5), text.scale = c(2.5,2.5,2.5,2.5,1.9,1.5),
              set_size.numbers_size=TRUE, set_size.show = TRUE)

gg_miss_fct(data_buy[,c(NA_vector_buy, "city")], fct = city) +
  theme(text = element_text(size = 25))

An analysis of missing values in the ‘buy’ dataset using an upset plot reveals some connections between condition, building material, and type. However, given the scale of missing values in these features, their mutual occurrence does not appear to be significant. On the other hand, in Gdańsk and Gdynia, the ‘condition’ attribute is missing more frequently than in other cities, while in Częstochowa, the ‘build year’ is notably absent more often.

Data set ‘rent’

# Upset plot of missing data and number of missings by a city - rent dataset
NA_vector_rent <- c("condition", "floor", "buildYear", "floorCount", "collegeDistance",
               "clinicDistance", "restaurantDistance", "pharmacyDistance", 
               "postOfficeDistance", "kindergartenDistance", "schoolDistance")

gg_miss_upset(data_rent, nsets=10, mb.ratio = c(0.5, 0.5), text.scale = c(2.5,2.5,2.5,2.5,1.9,1.5),
              set_size.numbers_size=TRUE, set_size.show = TRUE)

gg_miss_fct(data_rent[,c(NA_vector_rent, "city")], fct = city) +
  theme(text = element_text(size = 25))

A similar pattern is observed in the ‘rent’ dataset. Once again, some connections between condition, building material, and type are visible, but they do not appear to be significant. In this case, missing condition data is noticeable not only in Gdańsk and Gdynia but also in Częstochowa, where the ‘build year’ attribute is also frequently missing.

In the next steps will be checked correlation between NAs and other variable, for this purpose we will conduct below steps:

code NAs to 1 (in case of NA) and 0
code binary, categorical variables
exclude non binar variables

# Coding NAs
NA_data_buy = data_buy
NA_data_rent = data_rent

for (i in 1:length(NA_vector_buy)) {
  col_name <- paste("NA",NA_vector_buy[i], sep="_")
  NA_data_buy[[col_name]] <- ifelse(is.na(NA_data_buy[[NA_vector_buy[i]]]), 1, 0)
}

for (i in 1:length(NA_vector_rent)) {
  col_name <- paste("NA",NA_vector_rent[i], sep="_")
  NA_data_rent[[col_name]] <- ifelse(is.na(NA_data_rent[[NA_vector_rent[i]]]), 1, 0)
}

# Coding categorical variables
NA_data_buy$condition <- 
  NA_data_buy$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1

NA_data_rent$condition <- 
  NA_data_rent$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1

yes_no_variables <- c("hasParkingSpace", "hasBalcony", "hasElevator", 
                      "hasSecurity", "hasStorageRoom")

for (i in 1:length(yes_no_variables)){
  NA_data_buy[[yes_no_variables[i]]] <- 
    NA_data_buy[[yes_no_variables[i]]] %>% 
    ordered(c("no", "yes")) %>% 
    as.numeric()-1
  
  NA_data_rent[[yes_no_variables[i]]] <- 
    NA_data_rent[[yes_no_variables[i]]] %>% 
    ordered(c("no", "yes")) %>% 
    as.numeric()-1
}

# vector of features to exclude
exclude_HM <- c("city", "type", "ownership", "buildingMaterial", "month", 
                "condition", "hasElevator")

# Heat maps

selected_features1 <- names(NA_data_buy)[startsWith(names(NA_data_buy), "NA_")]
selected_features2 <- names(NA_data_buy)[!startsWith(names(NA_data_buy), "NA_")]

NA_data_buy[,!names(NA_data_buy) %in% exclude_HM] %>%
  cor(use = "pairwise.complete.obs") %>%
  round(digits=2) %>%
  melt() %>%
  filter(Var1 %in% selected_features2 & Var2 %in% selected_features1) %>%
  ggplot(aes(x=Var1, y=Var2, fill=value))  +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Correlation matrix with NA - 'buy' dataset") +
  scale_fill_gradient2(low = "red", mid = "#FFDD94", high = "green", midpoint = 0)+
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))  +
  theme(text = element_text(size = 25))

selected_features1 <- names(NA_data_rent)[startsWith(names(NA_data_rent), "NA_")]
selected_features2 <- names(NA_data_rent)[!startsWith(names(NA_data_rent), "NA_")]
  
NA_data_rent[, !names(NA_data_rent) %in% exclude_HM] %>%
  cor(use = "pairwise.complete.obs") %>%
  round(digits = 2) %>%
  melt() %>%
  filter(Var1 %in% selected_features2 & Var2 %in% selected_features1) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Correlation matrix with NA - 'rent' dataset") +
  scale_fill_gradient2(low = "red", mid = "#FFDD94", high = "green", midpoint = 0) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  theme(text = element_text(size = 25))

Correlation matrix doesn’t show any strong dependencies between missing and other variables

Outliers

For finding outliers we used couple methods:

Grubbs’ test
boxplots
Interqunatile Range Method - Q-Q plot

We paid special attention to price since this is the subject of our research. Grubbs’ test allows us defining features with potential outliers:

categorical_var <- c("city", "type", "ownership", "buildingMaterial", "condition", "hasParkingSpace", "hasBalcony", "hasElevator", "hasSecurity", "hasStorageRoom", "id", "month")

grubbs_test_buy <- 
  sapply(data_buy[,!names(data_buy) %in% categorical_var], grubbs.test)
var_buy <- cbind(grubbs_test_buy["p.value",][grubbs_test_buy["p.value",] < 0.05])
knitr::kable(var_buy, caption = "Variables with p-value lower than 5% (buy data set)", col.names = c("Variable", "p-value"))

Variables with p-value lower than 5% (buy data set)
Variable	p-value
floor	0
floorCount	1.00849896211841e-07
poiCount	0
schoolDistance	0
postOfficeDistance	0
kindergartenDistance	0
restaurantDistance	0
pharmacyDistance	0
price	2.03623467800451e-05

grubbs_test_rent <- 
 sapply(data_rent[,!names(data_rent) %in% categorical_var], grubbs.test)
var_rent <- cbind(grubbs_test_rent["p.value",][grubbs_test_rent["p.value",] < 0.05])
knitr::kable(var_rent, caption = "Variables with p-value lower than 5% (rent data set)", col.names = c("Variable", "p-value"))

Variables with p-value lower than 5% (rent data set)
Variable	p-value
floor	0
floorCount	2.41967365033346e-08
buildYear	0.0077240047029381
centreDistance	0.0210443044676376
poiCount	2.45829663292341e-08
schoolDistance	0
clinicDistance	0.000174023317614402
postOfficeDistance	0
kindergartenDistance	0
restaurantDistance	0
pharmacyDistance	0
price	0

The strongest outliers are suspected in floor, price, and proximity-related variables (schoolDistance, postOfficeDistance, etc.), suggesting unusually high or low values in these attributes. Outliers in rental prices could indicate luxury apartments or extremely cheap listings. Floor-related outliers might suggest penthouse apartments or ground-floor units with unusual characteristics. The rental data set has more variables with outlier candidates.

Outliers visualization for the ‘buy’ data set

# boxplots buy
bp1 <- ggplot(pivot_longer(data_buy[c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")], cols=c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Public destination distance outliers") + 
  labs(x="", y="Miles") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
  theme(text = element_text(size = 25)) 

bp2 <- ggplot(pivot_longer(data_buy[c("floor", "floorCount")], cols=c("floor", "floorCount")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Floor and floor count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp3 <- ggplot(pivot_longer(data_buy[c("poiCount")], cols="poiCount"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("PoI count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp4 <- ggplot(pivot_longer(data_buy[c("price")], cols="price"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Price outliers") + 
  labs(x="", y="PLN") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

ggarrange(bp1, bp2, bp3, bp4, ncol = 2, nrow = 2)

Outliers visualization for the ‘rent’ data set

# boxplots rent
bp1 <- ggplot(pivot_longer(data_rent[c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")], cols=c("schoolDistance", "postOfficeDistance","kindergartenDistance", "restaurantDistance", "pharmacyDistance")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Public destination distance outliers") + 
  labs(x="", y="Miles") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_discrete(guide = guide_axis(n.dodge = 3)) +
  theme(text = element_text(size = 25)) 

bp2 <- ggplot(pivot_longer(data_rent[c("floor", "floorCount")], cols=c("floor", "floorCount")), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Floor and floor count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp3 <- ggplot(pivot_longer(data_rent[c("poiCount")], cols="poiCount"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("PoI count outliers") + 
  labs(x="", y="Count") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

bp4 <- ggplot(pivot_longer(data_rent[c("price")], cols="price"), aes(name, value)) + 
  geom_boxplot() +
  ggtitle("Price outliers") + 
  labs(x="", y="PLN") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

ggarrange(bp1, bp2, bp3, bp4, ncol = 2, nrow = 2)

Some apartments are located in remote areas with poor access to these public facilities. Outliers in floor and floorCount variables indicate penthouses or very high-floor apartments in skyscrapers. High value of PoI suggests that some apartments are located in highly urbanized areas with dense amenities, while high prices indicate luxury apartments or highly premium properties. The boxplots exhibit a dense concentration of outliers, with no observations noticeably separated from the rest. Above mentioned suggests that the detected outliers are not due to erroneous data but rather a result of a fat-tailed distribution in the given features.

# Interqunatile Range Method for price outliers
buy_Q1 <- quantile(data_buy$price, 0.25, na.rm = TRUE)
buy_Q3 <- quantile(data_buy$price, 0.75, na.rm = TRUE)
buy_IQR_value <- buy_Q3 - buy_Q1
buy_lower_bound <- max(0, buy_Q1 - 1.5 * buy_IQR_value)  # Adjust lower bound to 0, because price cannot be negative
buy_upper_bound <- buy_Q3 + 1.5 * buy_IQR_value

rent_Q1 <- quantile(data_rent$price, 0.25, na.rm = TRUE)
rent_Q3 <- quantile(data_rent$price, 0.75, na.rm = TRUE)
rent_IQR_value <- rent_Q3 - rent_Q1
rent_lower_bound <- max(0, rent_Q1 - 1.5 * rent_IQR_value)  # Adjust lower bound to 0, because price cannot be negative
rent_upper_bound <- rent_Q3 + 1.5 * rent_IQR_value

# Identify price outliers 
buy_outliers <- 
  data_buy[data_buy$price < buy_lower_bound | data_buy$price > buy_upper_bound,]
rent_outliers <- 
  data_rent[data_rent$price < rent_lower_bound | data_rent$price > rent_upper_bound,]

# See how many values are outliers related to the original dataset
print(paste("Price outliers related to the original dataset - buy ", as.character(round((length(buy_outliers$price)/length(data_buy$price)) * 100, 2)), "%", sep=""))

## [1] "Price outliers related to the original dataset - buy 6.09%"

print(paste("Price outliers related to the original dataset - rent ", as.character(round(length(rent_outliers$price)/length(data_rent$price) * 100, 2)), "%", sep=""))

## [1] "Price outliers related to the original dataset - rent 7.57%"

Taking a closer look at the rent outliers

# Taking a closer look at the rent outliers
knitr::kable(data_rent[data_rent$price == min(data_rent$price), ][c("city", "price", "squareMeters", "poiCount")])

	city	price	squareMeters	poiCount
30685	katowice	346	30.36	4

knitr::kable(data_rent[data_rent$price == max(data_rent$price), ][c("city", "price", "squareMeters", "poiCount")])

	city	price	squareMeters	poiCount
32075	warszawa	23000	148.0	62
33121	warszawa	23000	147.6	60
33521	warszawa	23000	126.0	100

While 346 PLN for rent may seem low, there are multiple observations close to this price. There are 3 observations with price of 23 000 PLN, all located in the center of Warsaw, where there is a high density of points of interest (POI).

# Taking a closer look at the buy outlier
knitr::kable(data_buy[data_buy$price == min(data_buy$price), ][c("city", "price", "squareMeters", "poiCount")])

	city	price	squareMeters	poiCount
18202	bydgoszcz	150000	31.02	8
18206	bydgoszcz	150000	35.94	8

knitr::kable(data_buy[data_buy$price == max(data_buy$price), ][c("city", "price", "squareMeters", "poiCount")])

	city	price	squareMeters	poiCount
79943	warszawa	3250000	133.51	9
80351	warszawa	3250000	136.00	2
84104	warszawa	3250000	133.51	13
129837	warszawa	3250000	131.70	26

For apartments listed for sale, neither a price of 1,500,000 PLN nor 3,250,000 PLN appears alarming or unusual. There are multiple listings with prices close to these values, suggesting they are within a reasonable range.

An additional confirmation of the price distribution would be a Q-Q plot. Since apartment prices are typically log-normally distributed, a Q-Q plot of log-transformed rent prices would help assess their conformity to this distribution.

qqnorm(log(data_buy$price), main = "Normal Q-Q Plot ('buy' data set)", cex.lab=3, cex.main=3, cex.axis=3)
qqline(log(data_buy$price))

qqnorm(log(data_rent$price), main = "Normal Q-Q Plot ('rent' data set)", cex.lab=3, cex.main=3, cex.axis=3)
qqline(log(data_rent$price))

In our case distributions are fat-tailed.

Data validation

Data validation ensures data correctness by verifying logical consistency and identifying anomalies. In this data set, we checked the following properties:

Floor number is less than or equal to the total floor count.
Build year is less than 2024.
Latitude and longitude fall within Poland’s geographic range.

# Data validation
RULE <- editfile("RULES")
violated_buy <- violatedEdits(RULE, data_buy)
# summary(violated_buy)

RULE <- editfile("RULES")
violated_rent <- violatedEdits(RULE, data_rent)
# summary(violated_rent)

Data imputation

Since there are no incorrect data or outliers, we can proceed with imputing missing values using the MICE method, specifically the classification and regression trees (CART) variation. First, we encode text and categorical variables to make them suitable for imputation. Next, we perform the imputation, verify its success, and finally decode the variables back to their original format.

To avoid retraining the model each time, we will save the imputation results and disable the relevant cells from running in every Markdown execution.

# coding data - buy dataset
data_buy$condition <- 
  data_buy$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1 

data_buy$buildingMaterial <- 
  data_buy$buildingMaterial %>% 
  ordered(c("brick", "concreteSlab")) %>% 
  as.numeric()-1 

data_buy$type <- 
  data_buy$type %>% 
  ordered(c("apartmentBuilding", "blockOfFlats", "tenement")) %>% 
  as.numeric()-1 

data_buy$hasElevator <- 
  data_buy$hasElevator %>% 
  ordered(c("no", "yes")) %>% 
  as.numeric()-1

# MICE imputation
imp_buy <- mice(data_buy, m=3, maxit=3, method="cart")

# Pool results into final dataset
complete_data_buy <- complete(imp_buy)

# evaluating imputation results
data.frame(miss_var_summary(complete_data_buy))

Data was successfully imputed

# decoding variables
complete_data_buy$condition <- ifelse(complete_data_buy$condition == 0, "low", "premium")
complete_data_buy$buildingMaterial <- ifelse(complete_data_buy$buildingMaterial == 0, "brick", "concreteSlab")
complete_data_buy$type <- ifelse(complete_data_buy$type == 0, "apartmentBuilding",
                        ifelse(complete_data_buy$type == 1, "blockOfFlats", "tenement"))
complete_data_buy$hasElevator <- ifelse(complete_data_buy$hasElevator == 0, "no", "yes")

# saving imputed data
write.csv(complete_data_buy,"clean_buy.csv")

# coding vriables - rent dataset
data_rent$condition <- 
  data_rent$condition %>% 
  ordered(c("low", "premium")) %>% 
  as.numeric()-1 

data_rent$buildingMaterial <- 
  data_rent$buildingMaterial %>% 
  ordered(c("brick", "concreteSlab")) %>% 
  as.numeric()-1 

data_rent$type <- 
  data_rent$type %>% 
  ordered(c("apartmentBuilding", "blockOfFlats", "tenement")) %>% 
  as.numeric()-1 

data_rent$hasElevator <- 
  data_rent$hasElevator %>% 
  ordered(c("no", "yes")) %>% 
  as.numeric()-1

# Multiple data imputation
imp_rent <- mice(data_rent, m=3, maxit=3, method="cart")

# Pool results into final dataset
complete_data_rent <- complete(imp_rent)

# evaluating imputation
data.frame(miss_var_summary(complete_data_rent))

# decoding variables
complete_data_rent$condition <- ifelse(complete_data_rent$condition == 0, "low", "premium")
complete_data_rent$buildingMaterial <- ifelse(complete_data_rent$buildingMaterial == 0, "brick", "concreteSlab")
complete_data_rent$type <- ifelse(complete_data_rent$type == 0, "apartmentBuilding",
                        ifelse(complete_data_rent$type == 1, "blockOfFlats", "tenement"))
complete_data_rent$hasElevator <- ifelse(complete_data_rent$hasElevator == 0, "no", "yes")

# saving imputed data
write.csv(complete_data_rent,"clean_rent.csv")

Visualizations

complete_data_buy <- read.csv("clean_buy.csv")
complete_data_buy <- complete_data_buy %>%
  select(-X)
complete_data_rent <- read.csv("clean_rent.csv")
complete_data_rent <- complete_data_rent %>%
  select(-X)
# or
# complete_data_buy <- complete_data_buy
# complete_data_rent <- complete_data_rent

total_median_buy <- median(complete_data_buy$price)

plot_median_buy <- complete_data_buy %>%
  group_by(city) %>%
  summarise(median_price = median(price)) %>%
  ggplot(aes(x = reorder(city, median_price), 
             y = median_price,
             fill = median_price > total_median_buy)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("FALSE" = "#90EE90", "TRUE" = "#ffadad"), guide = "none") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Median Purchase Price by City", 
       x = "City", 
       y = "Median Price") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 


total_median_rent <- median(complete_data_rent$price)

plot_median_rent <- complete_data_rent %>%
  group_by(city) %>%
  summarise(median_price = median(price)) %>%
  ggplot(aes(x = reorder(city, median_price), 
             y = median_price,
             fill = median_price > total_median_rent)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("FALSE" = "#90EE90", "TRUE" = "#ffadad"), guide = "none") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Median Rent Price by City", 
       x = "City", 
       y = "Median Price") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(plot_median_buy, plot_median_rent, ncol = 2)

There are highest prices for housing purchasing in the largest Polish cities as it follows from the graph, where in red marked cities with median prices above overall median price. The highest prices are in Warsaw, Krakow and Gdansk, while the cheapest housings are in Czestochowa, Radom and Bydgoszcz. The rent prices are a little more close in terms of the median values, nevertheless there is an outlier - capital city Warsaw, where the median is over 30% higher than in other cities.

The impact of amenities on price

The images present scatter plots for different Polish cities, showing the relationship between apartment size (square meters) and price (PLN). The red line represents a fitted trend line for each city.

Purchase data set

ggplot(complete_data_buy, aes(x = squareMeters, y = price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "red") +
  facet_wrap(~city) +
  labs(title = "Appartemnt Price vs Size by City") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25))

Warsaw, Gdańsk, Kraków, and Gdynia have steeper slopes, meaning that prices per square meter are higher. Radom, Bydgoszcz, and Częstochowa have flatter slopes, suggesting that price per square meter is lower in these cities. Some cities (e.g., Warsaw, Kraków, and Gdańsk) exhibit higher dispersion, meaning a wider range of apartment prices for similar sizes. Outliers are visible, particularly in expensive markets like Warsaw and Gdynia, where some apartments are priced significantly higher than the trend line suggests. Regional cities (e.g., Lublin, Radom, and Rzeszów) have much lower price ceilings, indicating lower overall demand and affordability.

Buy data set

ggplot(complete_data_rent, aes(x = squareMeters, y = price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", color = "red") +
  facet_wrap(~city) +
  labs(title = "Appartemnt Price vs Size by City") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25))

Apartment size is a key determinant of price, but the rate at which price increases varies across cities. There is a clearly visible positive correlation between the size of an apartment and its price. The slope of the read line (smoothing) is the greatest for the cities with the most expensive apartments, so the result is correlated with conclusions made earlier.

Buy data set

complete_data_buy %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity, )) %>%
  ggplot(aes(x = has_amenity, y = price)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~amenity) +
  labs(title = "Price Distribution by Amenities") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25))

Rental data set

complete_data_rent %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity, )) %>%
  ggplot(aes(x = has_amenity, y = price)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~amenity) +
  labs(title = "Price Distribution by Amenities") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25))

The presence of amenities increases the price of housing, while there is an outlier amenity - storage room: the median price for housings with a storage room is lower than without it. The difference in price for the categories is plotted on the next graph.

pl1 <- complete_data_buy %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity)) %>%
  group_by(amenity) %>%
  mutate(
    avg_no = mean(price[has_amenity == "no"]),
    avg_yes = mean(price[has_amenity == "yes"]),
    pct_diff = (avg_yes - avg_no) / avg_no * 100
  ) %>%
  ggplot(aes(x = reorder(amenity, pct_diff), y = pct_diff, fill = pct_diff > 0)) +
  geom_col() +
  scale_fill_manual(values = c("FALSE" = "#ffadad", "TRUE" = "#90EE90"), guide = "none") +
  coord_flip() +
  labs(
    title = "Purchase data",
    x = "Amenity",
    y = "Price Difference (%)",
    fill = "Price Impact"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <- complete_data_rent %>%
  gather(amenity, has_amenity, c(hasBalcony, hasElevator, hasParkingSpace, hasStorageRoom, hasSecurity)) %>%
  group_by(amenity) %>%
  mutate(
    avg_no = mean(price[has_amenity == "no"]),
    avg_yes = mean(price[has_amenity == "yes"]),
    pct_diff = (avg_yes - avg_no) / avg_no * 100
  ) %>%
  ggplot(aes(x = reorder(amenity, pct_diff), y = pct_diff, fill = pct_diff > 0)) +
  geom_col() +
  scale_fill_manual(values = c("FALSE" = "#ffadad", "TRUE" = "#90EE90"), guide = "none") +
  coord_flip() +
  labs(
    title = "Rental data",
    x = "Amenity",
    y = "Price Difference (%)",
    fill = "Price Impact"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) paste0(round(x, 1), "%")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 


grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price Premium for Amenities",gp=gpar(fontsize=35)))

Amenities (all but storageRoom) seem to have a noticeable positive impact on price of the apartment. A lower purchase price for housings with storage rooms may be caused by hidden additional costs, which means that buyers are forced to pay extra for a storage room. The phenomena requires further analysis with broader range of data.

Correlation exploration

Purchase data set

distance_vars_buy <- select(complete_data_buy, contains("Distance"), price)
correlation_matrix_buy <- cor(distance_vars_buy, use = "complete.obs")
corrplot(correlation_matrix_buy, method = "color", addCoef.col = 1,
         tl.cex = 1.8, number.cex = 1.8)

There is no strong correlation found between the level of prices and distance to public places. However there is pretty solid positive correlation between variables of distances to public places.

Rental data set

distance_vars_rent <- select(complete_data_rent, contains("Distance"), price)
correlation_matrix_rent <- cor(distance_vars_rent, use = "complete.obs")
corrplot(correlation_matrix_rent, method = "color", addCoef.col = 1,
         tl.cex = 1.8, number.cex = 1.8)

Similar trends are observed on a rent market data set.

plot_age_buy <- ggplot(complete_data_buy, aes(x = buildYear, y = price)) +
  geom_smooth() +
  labs(title = "Purchase data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

plot_age_rent <- ggplot(complete_data_rent, aes(x = buildYear, y = price)) +
  geom_smooth() +
  labs(title = "Rental data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(plot_age_buy+theme(legend.position="top"),
    plot_age_rent+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price vs Building Age",gp=gpar(fontsize=35)))

Interesting things can be observed – model is less certain about the smooth price for apartments for rent built in the 70s, which might suggest there’s less of them available for rent. This phenomena is correlated with modern market rules, as the demand on the new apartments is higher. The apartments in very old buildings tend to be very expensive, but there are not many of them. Commie blocks are the least expensive for both rent and purchase. Out of modern buildings those built around 2508 seem to be the most expensive. The prices for the newest buildings seem to be falling with each year.

pl1 <- ggplot(complete_data_buy, aes(x = condition, y = price)) +
  geom_violin(fill = "beige") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <- ggplot(complete_data_rent, aes(x = condition, y = price)) +
  geom_violin(fill = "beige") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Rental data") +
  scale_y_continuous(labels = label_comma()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price Distribution by Property Condition",gp=gpar(fontsize=35)))

There is no significant influence of housing condition on the price level, nevertheless its visible that premium housings are a little bit more expensive. It can be noticed that the figure of premium condition has a more curvy shape, which can be explained by larger number of observation.

Descriptive analysis

Summary statistics

data_buy <- complete_data_buy
data_rent <- complete_data_rent

data_buy <- data_buy %>% 
  mutate(price_per_m2 = price / squareMeters)
data_rent <- data_rent %>% 
  mutate(price_per_m2 = price / squareMeters)

print(dfSummary(data_buy, na.col=FALSE, headings=FALSE, valid.col=FALSE),
      method = "render")

Variable

Stats / Values

Freqs (% of Valid)

Graph

city [character]

1. warszawa

2. krakow

3. wroclaw

4. gdansk

5. lodz

6. poznan

7. bydgoszcz

8. gdynia

9. szczecin

10. katowice

[ 5 others ]

31982	(	34.4%	)
14186	(	15.3%	)
9598	(	10.3%	)
8244	(	8.9%	)
7073	(	7.6%	)
3639	(	3.9%	)
3627	(	3.9%	)
3569	(	3.8%	)
2487	(	2.7%	)
2397	(	2.6%	)
6165	(	6.6%	)

type [character]

1. apartmentBuilding

2. blockOfFlats

3. tenement

19050	(	20.5%	)
59487	(	64.0%	)
14430	(	15.5%	)

squareMeters [numeric]

Mean (sd) : 56.5 (20.3)

min ≤ med ≤ max:

25 ≤ 52.6 ≤ 150

IQR (CV) : 22.8 (0.4)

7173 distinct values

rooms [integer]

Mean (sd) : 2.6 (0.9)

min ≤ med ≤ max:

1 ≤ 2 ≤ 6

IQR (CV) : 1 (0.3)

1	:	5960	(	6.4%	)
2	:	40684	(	43.8%	)
3	:	33745	(	36.3%	)
4	:	10150	(	10.9%	)
5	:	1853	(	2.0%	)
6	:	575	(	0.6%	)

floor [integer]

Mean (sd) : 3.3 (2.5)

min ≤ med ≤ max:

1 ≤ 3 ≤ 29

IQR (CV) : 2 (0.8)

27 distinct values

floorCount [integer]

Mean (sd) : 5.5 (3.3)

min ≤ med ≤ max:

1 ≤ 4 ≤ 29

IQR (CV) : 4 (0.6)

29 distinct values

buildYear [integer]

Mean (sd) : 1985.1 (32.5)

min ≤ med ≤ max:

1850 ≤ 1988 ≤ 2024

IQR (CV) : 46 (0)

165 distinct values

latitude [numeric]

Mean (sd) : 52 (1.3)

min ≤ med ≤ max:

50 ≤ 52.2 ≤ 54.6

IQR (CV) : 1.2 (0)

46449 distinct values

longitude [numeric]

Mean (sd) : 19.6 (1.7)

min ≤ med ≤ max:

14.4 ≤ 19.9 ≤ 23.2

IQR (CV) : 2.4 (0.1)

48233 distinct values

centreDistance [numeric]

Mean (sd) : 4.6 (2.8)

min ≤ med ≤ max:

0 ≤ 4.3 ≤ 16.9

IQR (CV) : 4.1 (0.6)

1487 distinct values

poiCount [integer]

Mean (sd) : 19.9 (22.8)

min ≤ med ≤ max:

0 ≤ 14 ≤ 212

IQR (CV) : 17 (1.1)

196 distinct values

schoolDistance [numeric]

Mean (sd) : 0.4 (0.5)

min ≤ med ≤ max:

0 ≤ 0.3 ≤ 4.9

IQR (CV) : 0.3 (1.1)

2505 distinct values

clinicDistance [numeric]

Mean (sd) : 1 (0.9)

min ≤ med ≤ max:

0 ≤ 0.7 ≤ 5

IQR (CV) : 0.9 (0.9)

4263 distinct values

postOfficeDistance [numeric]

Mean (sd) : 0.5 (0.5)

min ≤ med ≤ max:

0 ≤ 0.4 ≤ 5

IQR (CV) : 0.4 (0.9)

2689 distinct values

kindergartenDistance [numeric]

Mean (sd) : 0.4 (0.4)

min ≤ med ≤ max:

0 ≤ 0.3 ≤ 5

IQR (CV) : 0.3 (1.2)

2309 distinct values

restaurantDistance [numeric]

Mean (sd) : 0.3 (0.4)

min ≤ med ≤ max:

0 ≤ 0.2 ≤ 5

IQR (CV) : 0.3 (1.3)

2406 distinct values

collegeDistance [numeric]

Mean (sd) : 1.6 (1.2)

min ≤ med ≤ max:

0 ≤ 1.2 ≤ 5

IQR (CV) : 1.6 (0.8)

4817 distinct values

pharmacyDistance [numeric]

Mean (sd) : 0.4 (0.4)

min ≤ med ≤ max:

0 ≤ 0.2 ≤ 5

IQR (CV) : 0.3 (1.3)

2420 distinct values

ownership [character]

1. condominium

2. cooperative

3. udział

83368	(	89.7%	)
9592	(	10.3%	)
7	(	0.0%	)

buildingMaterial [character]

1. brick

2. concreteSlab

67529	(	72.6%	)
25438	(	27.4%	)

condition [character]

1. low

2. premium

38204	(	41.1%	)
54763	(	58.9%	)

hasParkingSpace [character]

1. no

2. yes

67949	(	73.1%	)
25018	(	26.9%	)

hasBalcony [character]

1. no

2. yes

38958	(	41.9%	)
54009	(	58.1%	)

hasElevator [character]

1. no

2. yes

46645	(	50.2%	)
46322	(	49.8%	)

hasSecurity [character]

1. no

2. yes

82428	(	88.7%	)
10539	(	11.3%	)

hasStorageRoom [character]

1. no

2. yes

51899	(	55.8%	)
41068	(	44.2%	)

price [integer]

Mean (sd) : 770307.8 (397438.1)

min ≤ med ≤ max:

150000 ≤ 689000 ≤ 3250000

IQR (CV) : 379000 (0.5)

7287 distinct values

month [character]

1. 2023-08-01
2. 2023-09-01
3. 2023-10-01
4. 2023-11-01
5. 2023-12-01
6. 2024-01-01
7. 2024-02-01
8. 2024-03-01
9. 2024-04-01
10. 2024-05-01
11. 2024-06-01

18905	(	20.3%	)
6098	(	6.6%	)
7024	(	7.6%	)
7001	(	7.5%	)
7865	(	8.5%	)
5020	(	5.4%	)
7130	(	7.7%	)
7331	(	7.9%	)
8917	(	9.6%	)
8634	(	9.3%	)
9042	(	9.7%	)

price_per_m2 [numeric]

Mean (sd) : 13803.1 (4926.1)

min ≤ med ≤ max:

3000 ≤ 13323 ≤ 32096.8

IQR (CV) : 6856.9 (0.4)

47186 distinct values

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-10

Summary table contains information about every column in the purchase data set. It can be noticed from the table that

Every city occupies different part of the country market;
The most common building type is block of flats, while apartment buildings and tenements occupy ~25% of the market each;
Average housing area (m²) is 58.7, while this variable has right-skewed distribution;
The most widespread are 2 and 3 room apartments, whereas other categories make up under 25% of the total;
The histogram of floor values is falling, which is expected behavior;
There are apartments in buildings build from 1850 to 2024. Average trend is growing, but there are some drops in certain periods;
Distance variables and POI count seem to have right skewed distribution, as public facilities are erected in the most populated areas;
Almost 90% of apartments are owned on condominium agreement.

print(dfSummary(data_rent, na.col=FALSE, headings=FALSE, valid.col=FALSE),
      method = "render")

Variable

Stats / Values

Freqs (% of Valid)

Graph

city [character]

1. warszawa

2. krakow

3. wroclaw

4. lodz

5. poznan

6. gdansk

7. katowice

8. gdynia

9. bydgoszcz

10. szczecin

[ 5 others ]

14785	(	39.0%	)
7366	(	19.4%	)
3699	(	9.7%	)
2647	(	7.0%	)
2217	(	5.8%	)
1930	(	5.1%	)
1453	(	3.8%	)
769	(	2.0%	)
731	(	1.9%	)
719	(	1.9%	)
1625	(	4.3%	)

type [character]

1. apartmentBuilding

2. blockOfFlats

3. tenement

16188	(	42.7%	)
17006	(	44.8%	)
4747	(	12.5%	)

squareMeters [numeric]

Mean (sd) : 53.4 (21.2)

min ≤ med ≤ max:

25 ≤ 48.5 ≤ 150

IQR (CV) : 22 (0.4)

3260 distinct values

rooms [integer]

Mean (sd) : 2.3 (0.8)

min ≤ med ≤ max:

1 ≤ 2 ≤ 6

IQR (CV) : 1 (0.4)

1	:	3874	(	10.2%	)
2	:	21664	(	57.1%	)
3	:	9416	(	24.8%	)
4	:	2271	(	6.0%	)
5	:	504	(	1.3%	)
6	:	212	(	0.6%	)

floor [integer]

Mean (sd) : 3.3 (2.6)

min ≤ med ≤ max:

1 ≤ 3 ≤ 30

IQR (CV) : 2 (0.8)

27 distinct values

floorCount [integer]

Mean (sd) : 5.7 (3.4)

min ≤ med ≤ max:

1 ≤ 5 ≤ 30

IQR (CV) : 3 (0.6)

30 distinct values

buildYear [integer]

Mean (sd) : 1995.8 (31.7)

min ≤ med ≤ max:

1850 ≤ 2009 ≤ 2024

IQR (CV) : 41 (0)

148 distinct values

latitude [numeric]

Mean (sd) : 51.7 (1.2)

min ≤ med ≤ max:

50 ≤ 52.2 ≤ 54.6

IQR (CV) : 1.2 (0)

22850 distinct values

longitude [numeric]

Mean (sd) : 19.7 (1.7)

min ≤ med ≤ max:

14.4 ≤ 20 ≤ 23.2

IQR (CV) : 2.4 (0.1)

23409 distinct values

centreDistance [numeric]

Mean (sd) : 3.9 (2.6)

min ≤ med ≤ max:

0 ≤ 3.5 ≤ 16.6

IQR (CV) : 3.7 (0.7)

1302 distinct values

poiCount [integer]

Mean (sd) : 24.2 (26.2)

min ≤ med ≤ max:

0 ≤ 16 ≤ 210

IQR (CV) : 21 (1.1)

188 distinct values

schoolDistance [numeric]

Mean (sd) : 0.4 (0.3)

min ≤ med ≤ max:

0 ≤ 0.3 ≤ 4.9

IQR (CV) : 0.3 (0.9)

1667 distinct values

clinicDistance [numeric]

Mean (sd) : 0.8 (0.7)

min ≤ med ≤ max:

0 ≤ 0.6 ≤ 5

IQR (CV) : 0.7 (0.9)

3162 distinct values

postOfficeDistance [numeric]

Mean (sd) : 0.5 (0.4)

min ≤ med ≤ max:

0 ≤ 0.4 ≤ 4.9

IQR (CV) : 0.4 (0.8)

1878 distinct values

kindergartenDistance [numeric]

Mean (sd) : 0.3 (0.3)

min ≤ med ≤ max:

0 ≤ 0.3 ≤ 4.8

IQR (CV) : 0.2 (1)

1464 distinct values

restaurantDistance [numeric]

Mean (sd) : 0.3 (0.3)

min ≤ med ≤ max:

0 ≤ 0.2 ≤ 5

IQR (CV) : 0.2 (1.2)

1455 distinct values

collegeDistance [numeric]

Mean (sd) : 1.4 (1.1)

min ≤ med ≤ max:

0 ≤ 1 ≤ 5

IQR (CV) : 1.4 (0.8)

4153 distinct values

pharmacyDistance [numeric]

Mean (sd) : 0.3 (0.3)

min ≤ med ≤ max:

0 ≤ 0.2 ≤ 5

IQR (CV) : 0.3 (1.1)

1489 distinct values

ownership [character]

1. condominium

37941

(

100.0%

)

buildingMaterial [character]

1. brick

2. concreteSlab

33271	(	87.7%	)
4670	(	12.3%	)

condition [character]

1. low

2. premium

865	(	2.3%	)
37076	(	97.7%	)

hasParkingSpace [character]

1. no

2. yes

23159	(	61.0%	)
14782	(	39.0%	)

hasBalcony [character]

1. no

2. yes

14505	(	38.2%	)
23436	(	61.8%	)

hasElevator [character]

1. no

2. yes

13664	(	36.0%	)
24277	(	64.0%	)

hasSecurity [character]

1. no

2. yes

32234	(	85.0%	)
5707	(	15.0%	)

hasStorageRoom [character]

1. no

2. yes

31784	(	83.8%	)
6157	(	16.2%	)

price [integer]

Mean (sd) : 3682.3 (2186)

min ≤ med ≤ max:

346 ≤ 3000 ≤ 23000

IQR (CV) : 1750 (0.6)

996 distinct values

month [character]

1. 2023-11-01
2. 2023-12-01
3. 2024-01-01
4. 2024-02-01
5. 2024-03-01
6. 2024-04-01
7. 2024-05-01
8. 2024-06-01

8533	(	22.5%	)
2790	(	7.4%	)
1865	(	4.9%	)
5278	(	13.9%	)
5160	(	13.6%	)
4953	(	13.1%	)
4641	(	12.2%	)
4721	(	12.4%	)

price_per_m2 [numeric]

Mean (sd) : 68.7 (22.6)

min ≤ med ≤ max:

11.3 ≤ 65.7 ≤ 189.5

IQR (CV) : 27.5 (0.3)

9600 distinct values

Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-10

Rental market data in general follow the same trend, however there are some differences:

Apartment buildings are the most common type of building the housing is located in;
Area of the apartments is a bit lower (average 53.4 m²) than in case of purchase market;
Two room apartments are the most represented category with over 57% of items;
The average of build year is higher for rental market (by 12 years), which is explained by higher attractiveness of modern buildings for tenants;
97.7% of the apartments is declared to be in premium condition.

Share of market by city with contingency tables

table <- table(data_buy$city) %>% 
  as.data.frame() 

table$Share <- round(prop.table(table$Freq), 4)*100 

arrange(table, desc(Share)) %>% 
  kable(col.names=c("City", "Quantity", "Share, %")) %>% 
  kable_styling()

City	Quantity	Share, %
warszawa	31982	34.40
krakow	14186	15.26
wroclaw	9598	10.32
gdansk	8244	8.87
lodz	7073	7.61
poznan	3639	3.91
bydgoszcz	3627	3.90
gdynia	3569	3.84
szczecin	2487	2.68
katowice	2397	2.58
lublin	2299	2.47
czestochowa	1242	1.34
bialystok	1017	1.09
radom	845	0.91
rzeszow	762	0.82

It follows from the data that the largest share of the apartments market is in Warsaw (above 30% of the country market), smaller parts of the market represent next major cities: Kraków (15.26%), Wrocaław (10.32%) and Gdańsk (8.87%).

table <- table(data_rent$city) %>% 
  as.data.frame() 

table$Share <- round(prop.table(table$Freq), 4)*100 

arrange(table, desc(Share)) %>% 
  kable(col.names=c("City", "Quantity", "Share, %")) %>% 
  kable_styling()

City	Quantity	Share, %
warszawa	14785	38.97
krakow	7366	19.41
wroclaw	3699	9.75
lodz	2647	6.98
poznan	2217	5.84
gdansk	1930	5.09
katowice	1453	3.83
gdynia	769	2.03
bydgoszcz	731	1.93
szczecin	719	1.90
lublin	632	1.67
czestochowa	345	0.91
rzeszow	310	0.82
bialystok	223	0.59
radom	115	0.30

Warsaw has even larger part of the rent market almost 40%, growth also noticeable in the market share of Krakow (+4%). Remaining cities have minor shares of this market (under 10% each).

Price distribution shape with tabular data

# range(data_buy$price)

limits <- seq(min(data_buy$price)-500000, max(data_buy$price)+500000, by=100000)

limits_factor <- cut(data_buy$price, limits)
table_price <- table(limits_factor, dnn = c("Limits"))
# transform(table_price, Percentage=prop.table(Freq)) #, Cum_Freq=cumsum(Freq))

hist(data_buy$price, prob=TRUE, breaks=limits, main="Price groupping for purchase market", xlab="Price in PLN", col="lightblue", cex.lab=3, cex.main=3, cex.axis=3)
lines(density(data_buy$price), col=2)

print("TAI (tabular accuracy index)")

## [1] "TAI (tabular accuracy index)"

tab_accuracy <- classIntervals(data_buy$price, n=10, style="fixed", fixedBreaks=limits)
#tabular accuracy index
jenks.tests(tab_accuracy)

##        # classes  Goodness of fit Tabular accuracy 
##       41.0000000        0.9947621        0.9148005

pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3', 'red', 'yellow', 'brown', 'black')
colcode=findColours(tab_accuracy, pcol)
plot(tab_accuracy, pal=pcol, ylab='Cumulative %', cex.lab=3, cex.main=3, cex.axis=3)

The price has skewed distribution distribution. A histogram was build to show how many housings fall in certain price levels from minimum price 150000 up to the maximum 3250000 by 100000 PLN. The histogram shows that most represented category is 600-700 thousands PLN. The partition accuracy was checked with the tabular accuracy index, which shows high tabular accuracy and goodness of fit.

# range(data_rent$price)
limits <- seq(min(data_rent$price) - 300, max(data_rent$price) + 300, by=500)

limits_factor <- cut(data_rent$price, limits)
table_price <- table(limits_factor, dnn = c("Limits"))
# transform(table_price, Percentage=prop.table(Freq)) #, Cum_Freq=cumsum(Freq))

hist(data_rent$price, prob=TRUE, breaks=limits, main="Price groupping for rental market", xlab="Price in PLN", col="lightblue", cex.lab=3, cex.main=3, cex.axis=3)
lines(density(data_rent$price), col=2)

print("TAI (tabular accuracy index)")

## [1] "TAI (tabular accuracy index)"

tab_accuracy <- classIntervals(data_rent$price, n=10, style="fixed", fixedBreaks=limits)
#tabular accuracy index
jenks.tests(tab_accuracy)

##        # classes  Goodness of fit Tabular accuracy 
##       46.0000000        0.9959035        0.9154236

pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3', 'red', 'yellow', 'brown', 'black')
colcode=findColours(tab_accuracy, pcol)
plot(tab_accuracy, pal=pcol, ylab='Cumulative %', cex.lab=3, cex.main=3, cex.axis=3)

The shapes of the rental market distribution is narrower than purchase market one, which can be explained by the target groups of these offers (often these group cannot afford to buy a housing).

pl1 <- data_buy %>%
  ggplot(aes(x=price_per_m2)) +
  geom_density(color="#69b3a2", alpha=0.8, size=1.5) +
  geom_vline(aes(xintercept=median(price_per_m2)), linetype="dashed", size=0.8, color="69b3a2") +
  annotate("text", x=12100, y=0.00004, label=paste("Median: ", as.character(median(round(data_buy$price_per_m2, 2)))), angle=90, color="69b3a2", size=15) +
  ggtitle("Purchase market") + 
  labs(x="PLN per square meter", y="Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <- data_rent %>%
  ggplot(aes(x=price_per_m2)) +
  geom_density(color="#69b3a2", alpha=0.8, size=1.5) +
  geom_vline(aes(xintercept=median(price_per_m2)), linetype="dashed", size=0.8, color="69b3a2") +
  annotate("text", x=57, y=0.01, label=paste("Median: ", as.character(median(round(data_rent$price_per_m2, 2)))), angle=90, color="69b3a2", size=15) +
  ggtitle("Rental market") + 
  labs(x="PLN per square meter", y="Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 


grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price per square meter",gp=gpar(fontsize=35)))

This image shows density plots for the price per square meter in both the purchase and rental markets. The distribution graphs of prices per square meter are less skewed than normal prices and additionally:

Purchase Market
- The distribution appears right-skewed (positively skewed), meaning more properties have prices below the median, with some high-priced outliers.
- The median price per square meter is 13,322.98 PLN.
- There’s a peak around 10,000-15,000 PLN, suggesting a common price range.
Rental Market
- Also right-skewed, but with a sharper peak and a steeper decline.
- The median rental price per square meter is 65.67 PLN.
- Most rentals are concentrated in the 40-80 PLN per square meter range.

Housing type percentage in different cities

type_across_cities <- table(data_buy$city, data_buy$type, dnn = c("City", "Type"))
addmargins(round(prop.table(type_across_cities, 1) * 100, 2), 2) %>% 
  kable() %>% 
  kable_styling()

	apartmentBuilding	blockOfFlats	tenement	Sum
bialystok	11.50	85.25	3.24	99.99
bydgoszcz	8.77	65.95	25.28	100.00
czestochowa	7.33	85.35	7.33	100.01
gdansk	19.30	64.85	15.85	100.00
gdynia	20.62	67.39	11.99	100.00
katowice	19.40	56.86	23.74	100.00
krakow	20.05	65.68	14.27	100.00
lodz	15.07	69.70	15.23	100.00
lublin	3.48	87.12	9.40	100.00
poznan	14.87	64.91	20.23	100.01
radom	3.08	89.23	7.69	100.00
rzeszow	12.73	82.94	4.33	100.00
szczecin	8.04	62.08	29.88	100.00
warszawa	26.58	60.10	13.32	100.00
wroclaw	24.76	55.13	20.12	100.01

The table shows percentage of housing of the certain type per city. There are cities with higher percentage of apartment buildings (Gdansk, Gdynia, Krakow, Warszawa and Wroclaw), which are characterized by greater market share and higher prices, while in smaller cities with lower prices relative quantity of offers in blocks of flats is higher (Bialystok, Czestohchowa, Lublina, Radom). Nonetheless there is a common trend for each of the cities: the majority of housings is offered in blocks of flats, while apartments of buildings and tenements make up a smaller group of housings.

type_across_cities <- table(data_rent$city, data_rent$type, dnn = c("City", "Type"))
addmargins(round(prop.table(type_across_cities, 1) * 100, 2), 2) %>% 
  kable() %>% 
  kable_styling()

	apartmentBuilding	blockOfFlats	tenement	Sum
bialystok	26.01	70.40	3.59	100.00
bydgoszcz	21.07	57.73	21.20	100.00
czestochowa	16.52	72.17	11.30	99.99
gdansk	38.86	51.71	9.43	100.00
gdynia	26.92	61.51	11.57	100.00
katowice	39.99	48.80	11.22	100.01
krakow	41.35	46.17	12.48	100.00
lodz	45.86	45.67	8.46	99.99
lublin	19.30	67.25	13.45	100.00
poznan	31.75	56.52	11.73	100.00
radom	20.00	72.17	7.83	100.00
rzeszow	28.71	66.13	5.16	100.00
szczecin	27.96	59.67	12.38	100.01
warszawa	49.34	36.02	14.64	100.00
wroclaw	45.61	45.07	9.33	100.01

The situation is somewhat different for the rental market. There is an increase in apartment buildings type offers resulting in an incline in percentage of block of flats. Quantity of offers in tenements is also reduced insignificantly.

Descriptive characteristics of price per m² in purchase data set

type	mean	sd	median	min	max	skewness	kurtosis
apartmentBuilding	16835.35	4883.377	16614.36	4137.931	32096.77	0.3299874	2.672051
blockOfFlats	12820.99	4233.305	12531.38	3435.864	30935.48	0.5797487	3.329609
tenement	13848.69	5904.646	12983.91	3000.000	31041.67	0.4951448	2.487640

The most expensive in terms of price per square meter are housings in apartment buildings. Their average price is 26.8% more for blocks of flats and 25.1% for tenements in purchasing. The smallest standard deviation** value for blocks of flats in both markets can be explained by average characteristics of this type of housing: locations and conditions are close in different blocks of flats or apartments, while there are old, cheaper tenements and relatively small amount of the new ones in perfect condition and location so the deviation is significantly larger. Positive values of skewness indicate the tail of the distribution is on the right, which means the amount of more expensive apartments is fading more gradually than the cheaper ones. Kurtosis indicates quite peaked distributions for each category.

Descriptive characteristics of price per m² in rental data set

type	mean	sd	median	min	max	skewness	kurtosis
apartmentBuilding	75.76773	23.22751	72.37607	11.30396	187.8378	1.1240927	5.142059
blockOfFlats	61.56362	19.15648	59.25926	11.29861	189.4737	0.9502302	5.420271
tenement	70.01260	24.04812	67.30769	11.29971	189.3146	0.7050550	3.895923

Apartments in apartment buildings have the highest average rent (75.77 PLN/m²) and median (72.38 PLN/m²), suggesting they tend to be more expensive. Block of flats has the lowest average (61.56 PLN/m²) and median (59.26 PLN/m²), meaning they are generally more affordable, while tenements are in between as it was for purchasing data set. Standard deviation is highest for tenements (24.05 PLN/m²) and apartment buildings (23.23 PLN/m²), indicating a wider spread in rental prices. The maximum rental price is quite high for all categories (~189 PLN/m²), suggesting some luxury or premium properties significantly impact the upper range. Apartment buildings (5.14) and block of flats (5.42) have higher kurtosis, indicating a sharper peak with more extreme values. Tenements (3.90) have a more moderate kurtosis, meaning a more balanced distribution without extreme outliers.

mu_buy <- ddply(data_buy, "type", summarise, grp.mean=mean(price_per_m2))

pl1 <- ggplot(data_buy, aes(x=price_per_m2, color=type)) +
  geom_density(size=1.5) +
  theme(legend.position="top") +
  geom_vline(data=mu_buy, aes(xintercept=grp.mean, color=type), linetype="dashed") +    
  scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9")) +
  theme_classic() +
  labs(x="PLN per square meter", y="Density", subtitle  = "Purchase market") +
  theme() +
  theme(text = element_text(size = 25)) 

mu_rent <- ddply(data_rent, "type", summarise, grp.mean=mean(price_per_m2))

pl2 <- ggplot(data_rent, aes(x=price_per_m2, color=type)) +
  geom_density(size=1.5) +
  geom_vline(data=mu_rent, aes(xintercept=grp.mean, color=type), linetype="dashed") +   
  scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9")) + 
  theme_classic() +
  labs(x="PLN per square meter", y="Density", subtitle  = "Rental market") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Distribution of price per square meter by housing type",gp=gpar(fontsize=35))
)

This image presents density plots for the distribution of price per square meter in both the purchase and rental markets, categorized by housing type (apartment buildings, blocks of flats and tenements). Purchase prices are more spread out than rental prices, particularly for apartment buildings. Rental prices are relatively more concentrated, with fewer extreme values. Blocks of flats consistently show the lowest prices and least variation, making them the most affordable option. Apartment buildings are the most expensive, both in purchase and rental markets, with more high-priced outliers.

pl1 <- ggplot(data_buy, aes(x=type,  fill=buildingMaterial))+
  geom_bar(stat="count", width=0.95) + 
  scale_fill_manual(values=c("darkred", "grey")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

pl2 <-ggplot(data_rent, aes(x=type,  fill=buildingMaterial))+
  geom_bar(stat="count", width=0.95) + 
  scale_fill_manual(values=c("darkred", "grey")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(text = element_text(size = 25)) 

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Housing type vs building material",gp=gpar(fontsize=35))
)

The majority of housings is build of brick, while this material is used more frequently in apartment buildings and tenements than in blocks of flats, which are made of concrete slab in one third of cases. The trend is common for purchasing and rental markets.

Statistical tests

Statistical tests are conducted to determine whether observed data patterns are due to chance or represent real effects. They help in making objective decisions, validating hypotheses, and drawing conclusions in research by measuring relationships, differences, or associations within data sets.

Statistical significance of price median difference between cities

The main purpose of the statistical test we are going to perform is to determine which variables have a statistically significant impact on both rent and buy prices.

palette_colors <- viridis(length(unique(complete_data_buy$city)))

complete_data_buy %>% 
  ggbetweenstats(
    y = price,
    x = city,
    bf.message = FALSE,
    pairwise.display = "none",
    title = "Distribution of buy price across cities - buy dataset",
    palette = "default_ucscgb",
    package = "ggsci",
    centrality.label.args = list(size  = 7)
  ) +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))

complete_data_rent %>% 
  ggbetweenstats(
    y=price,
    x=city,
    bf.message = FALSE,
    pairwise.display = "none",
    title = "Distribution of rent price across cities - rent dataset",
    palette = "default_ucscgb",
    package = "ggsci",
    centrality.label.args = list(size  = 7)
  ) +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))

Based on Welch’s test, we obtained a p-value close to 0, so we reject the null hypothesis of equal means between populations in both the buy and rent data sets. This means that the city statistically significantly differentiates both rent and buy prices. However, an effect size of 0.86 implies that this differentiation has a moderate magnitude.

Statistical significance differentiation of price by amenities

amentities <- c("hasBalcony", "hasElevator", "hasParkingSpace", "hasStorageRoom", "hasSecurity")
lst <- list()
for (i in amentities) {
  lst[[i]] <-  
      ggbetweenstats(complete_data_buy,
        y = price,
        x = !!sym(i),
        bf.message = FALSE,
        palette = "default_ucscgb",
        package = "ggsci",
        title = i,
        centrality.label.args = list(size  = 7),
        ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
      ) +
      theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))  +
      xlab("")
}

grid.arrange(grobs=lst,
    ncol = 3,
    nraw = 2,
    top =textGrob("Buy Price Distribution by amenities",gp=gpar(fontsize=35))
)

All amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.14 for a balcony to -0.57 for an elevator.

for (i in amentities) {
  lst[[i]] <-  
      ggbetweenstats(complete_data_buy,
        y = price,
        x = !!sym(i),
        bf.message = FALSE,
        palette = "default_ucscgb",
        package = "ggsci",
        title = i,
        centrality.label.args = list(size  = 7),
        ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
      ) +
      theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))  +
      xlab("")
}

grid.arrange(grobs=lst,
    ncol = 3,
    nraw = 2,
    top =textGrob("Rent Price Distribution by amenities",gp=gpar(fontsize=35))
)

Similarly to buy prices, all amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.10 for a balcony to -0.31 for an elevator.

pl1 <-  ggbetweenstats(complete_data_buy,
    y = price,
    x = condition,
    bf.message = FALSE,
    palette = "default_ucscgb",
    package = "ggsci",
    title = "Buy data set",
    centrality.label.args = list(size  = 10),
    ggplot.component = list(theme(plot.subtitle = element_text(size = 10, face = "bold")))
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25))  

pl2 <-  ggbetweenstats(complete_data_rent,
    y = price,
    x = condition,
    bf.message = FALSE,
    palette = "default_ucscgb",
    package = "ggsci",
    title = "Rental data set",
    centrality.label.args = list(size  = 10),
    ggplot.component = list(theme(plot.subtitle = element_text(size = 12, face = "bold")))
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 35), text = element_text(size = 25)) +
  ylab("")

grid.arrange(pl1+theme(legend.position="top"),
    pl2+theme(legend.position="top"),
    ncol = 2,
    top =textGrob("Price difference by condition",gp=gpar(fontsize=35))
)

Finally, condition of flat in both rent and buy data sets also seem to differentiate price significantly but has only moderate magnitude.

Conclusions

This project provides a comprehensive analysis of the real estate market, utilizing statistical methods and visualizations to uncover key trends and insights. The study examines data structure, performs data cleansing, and addresses outliers to ensure accuracy. Through descriptive analysis and summary statistics, the project identifies variations in market share across cities and explores price distribution patterns.

The following problems were explored among others:

A data-driven overview of real estate pricing trends.
Identification of key factors influencing property prices.
Insights into how housing types and amenities impact pricing.
Statistical validation of differences in pricing across various cities.

Findings and insights

Certain cities (e.g., Gdańsk, Gdynia, Częstochowa) show significant missing data in condition and build year attributes, however there is no strong correlation between missing values and other features. The highest property prices are in Warsaw, Krakow, and Gdansk, while the cheapest cities are Czestochowa, Radom, and Bydgoszcz. Property prices follow a fat-tailed distribution, while high-priced properties are concentrated in city centers with dense places of interest (PoI). No strong correlation is found between price levels and proximity to public places, but distances between amenities show a strong relationship.

Rental and purchase price distributions by cities, amenities and condition differ significantly, as confirmed by Welch’s test (p-value ≈ 0). Rental prices are more uniform, except in Warsaw, where the median rent is 30% higher than in other cities. A positive correlation exists between apartment size and price, with larger price increases in cities with higher property costs. Amenities generally increase housing prices, except for storage rooms, which correlate with lower prices, but the effect size is small (e.g., -0.14 for balconies, -0.57 for elevators). Housing condition has a moderate impact on prices, but its effect size is limited.

The study confirms that city location, property size, and amenities significantly influence real estate prices. While rental prices show more uniformity, purchase prices exhibit stronger variations based on property type and city. Warsaw dominates both markets, with modern buildings losing value over time. Statistical tests validate these findings, providing insights for investors, buyers, and policymakers.

Real estate market analysis

Nazar Ostrowski, Norbert Gała, Stanisław Kochnowski

2025-02-10

Introduction

Objectives

Data structure

Files

Data fields

Data Cleansing and Wrangling

Removing unnecessary data

Data summary

Checking NAs

Data set ‘buy’

Data set ‘rent’

Outliers

Outliers visualization for the ‘buy’ data set

Outliers visualization for the ‘rent’ data set

Taking a closer look at the rent outliers

Data validation

Data imputation

Visualizations

The impact of amenities on price

Purchase data set

Buy data set

Buy data set

Rental data set

Correlation exploration

Purchase data set

Rental data set

Descriptive analysis

Summary statistics

Price distribution shape with tabular data

Housing type percentage in different cities

Descriptive characteristics of price per m² in purchase data set

Descriptive characteristics of price per m² in rental data set

Statistical tests

Statistical significance of price median difference between cities

Statistical significance differentiation of price by amenities

Conclusions

Findings and insights