Introduction

Through the centuries people have invented unlimited variety of techniques and tools to invest money. Real-estate market is probably one of the oldest. It guarantees, seemingly, a perfect trade-off between cash, and something which allows to convert an abstract concept of money into more tangible thing – hectares of rural plots, houses etc. However, despite the lapse of time and hundreds of years the market exists in different forms, there are no perfect answers on what drives the prices of specific types of properties. Sometimes one can get an impression that the attempts to predict the housing prices, which are widely described around different media, are closer to some kind of guessing than any scientific approach. When the coronavirus pandemic in Poland and the lockdown related to it were in peak, one could read the interviews with real-estate experts who were claiming that the prices are not going to go down. For a relatively economic-conscious person such statements might be confusing. How can it be that all the markets are most likely to enter a recession and, at the same time, the prices of properties would increase? There is no easy answer to that question. Sometimes, the experts might be wrong on purpose, driven by their own speculative interest. However, what is clear is that there are many determinants of property prices, having their roots in different fields. Starting from the characteristics of properties, going through the situation on financial markets and ending with their influence on investing behavior of people. In this work characteristics of properties are a subject of analysis.

The main objective of this study is to find determinants of prices of apartments in Warsaw. The relation between characteristics of specific property and its value on the online-market is analyzed. The data obtained from otodom.pl, which is one of the largest real-estate platforms in Poland, is combined with spatial and district characteristics. In this analysis multiple machine-learning methods, starting from OLS regression, through spatial-econometrics specific methods and ending with non-parametric model, are utilized. There are two main hypotheses verified in this work.

Hypothesis One
Spatial-based variables are significant predictors of housing-prices


Hypothesis Two
The coronavirus pandemic had a negative impact on the housing-prices

1. Data

The dataset used in the analysis is created by the authors based on several sources. There are three general groups of variables:

Property characteristics were obtained from otodom.pl website with the use of web scraping algorithm written in Python. All available features of apartments were collected from adverts, amid which address information is the key one from the perspective of this study. Spatial data includes distances from collected apartments to a few, arbitrary selected, points on a map of Warsaw. To measure the distances it was neccessary to employ publicly available APIs:

District characteristics are captured with the measures able to reflect attractiveness of living conditions in particular Warsaw districts. The need to control for such variables was determined by literature overview, which proves that life-quality matters in the case of housing-prices.3 The life-quality measures were assessed from the report prepared by the Statistical Office in Warsaw.4

Table 1 presents all the variables analyzed with its detailed descriptions.

df_final <- readxl::read_excel("df_final_v4.xlsx")
df_final$...1 <- NULL

Table 1. Variables and their description

Variable Name Variable description
District The district
Total_price Total price of the property in PLN
Days_from_initial_announcement Days from initial announcement appeared on the website
Rent Rent
Ownership Form of ownership
Floor Floor
Construction_type Type of construction
Finishing_condition Finishing condition
No_of_floors Number of floors in the building
Property_market Property market (if primary or secondary)
Area Living area of the property in squared metres
No_of_rooms Number of rooms
Year_built Construction completion year
Heating Heating type
Lat Latitude of the property
Lon Longitude of the property
Distance_to_PKIN_in_km Distance to the center (indicated by the PKiN building) in km
Time_to_PKIN_in_minutes Time to the center (indicated by the PKiN building) in minutes
Distance_to_subway_in_km Distance to the nearest subway in km
Distance_to_airport_in_km Distance to Chopin Airport in km
Price_per_sqm Price per square metre
Population_density Population density of district
Vistula_side(1_for_west_0_for_east) Which side of the Vistula the property is located (1=West, 0=East)
Life_quality Life quality within a district
Family_friendliness Family friendliness within a district
Single_people_friendliness Single people friendliness within a district
Senior_people_friendliness Senior people friendliness within a district

1.1. Data preparation

Raw data preparation and manipulation are presented in this section. There are three main areas that it covers: proper variables formatting, variables categorization and missing values treatment.

1.1.1. Variables formatting

  • Initial formatting of the variables was investigated.
nums <- sapply(df_final, is.numeric)
df_final_nums <- df_final[,nums]
kable(head(df_final_nums)) %>%
  kable_styling(bootstrap_options = c("hover", "responsive")) %>%
  scroll_box(width = "100%", height = "90%")
Total_price Rent No_of_floors Area Year_built Lat Lon Distance_to_PKIN_in_km Time_to_PKIN_in_minutes Price_per_sqm Population_density Vistula_side Life_quality Family_friendliness Single_people_friendliness Senior_people_friendliness Distance_to_subway_in_km Distance_to_airport_in_km
365065 NA 5 50.34 2020 52.34123 20.94093 16.244900 22.551667 7251.986 1699 0 0.324 0.417 0.148 0.230 5.7014217 18.090115
450000 NA 3 46.70 1989 52.25344 20.91172 8.450500 14.270000 9635.974 4967 1 0.352 0.307 0.250 0.197 3.7484985 9.561767
420000 NA 17 34.00 1967 52.23196 21.00672 5.742090 8.734317 12352.941 6174 1 0.579 0.561 0.740 0.529 0.3477333 9.887956
530000 NA 3 46.60 1964 52.26685 20.97277 5.170000 7.865000 11373.391 6174 1 0.579 0.561 0.740 0.529 0.5360078 10.920845
735000 530 7 59.00 2019 52.23196 21.00672 4.316745 7.579726 12457.627 7319 1 0.462 0.409 0.560 0.526 0.3477333 9.887956
630070 NA 7 60.22 2020 52.17551 20.99791 7.448400 10.970000 10462.803 6146 1 0.592 0.469 0.491 0.613 2.4176840 6.323684


  • For variables which are qualitative or binary, factor formatting is applied.
oth <- which(!nums)
df_final_oth <- df_final[,oth]
kable(head(df_final_oth)) %>%
  kable_styling(bootstrap_options = c("hover", "responsive")) %>%
  scroll_box(width = "100%", height = "90%")
District Days_from_initial_announcement Ownership Floor Construction_type Finishing_condition Property_market No_of_rooms Heating
Białołęka 14 NA 1 NA NA pierwotny 2 NA
Bemowo 9 pełna własność 2 blok NA wtórny 3 miejskie
Żoliborz Less than a day NA > 10 NA NA wtórny 2 NA
Żoliborz 16 NA 3 blok do zamieszkania wtórny 2 miejskie
Wola 14 pełna własność 6 blok do wykończenia pierwotny 3 NA
Mokotów 13 NA 4 blok do wykończenia wtórny 3 miejskie

df_final$District <- as.factor(df_final$District)
df_final$Ownership <- as.factor(df_final$Ownership)
df_final$Construction_type <- as.factor(df_final$Construction_type)
df_final$Finishing_condition <- as.factor(df_final$Finishing_condition)
df_final$Property_market <- as.factor(df_final$Property_market)
df_final$Heating <- as.factor(df_final$Heating)
df_final$Vistula_side <- as.factor(df_final$Vistula_side)

1.1.2. Features engineering

In the next stage, categorization of numerical variables takes place. The variable cuts are imposed based on:

  • A relationship between the dependent variable, being Price_per_sqm, and variable of interest
  • Splits obtained with the decision tree algorithm.

Number of floors

Figure 1. Number of floors vs Price per sqm

ggplot(df_final, aes(x = No_of_floors, y = Price_per_sqm)) + 
  geom_point(color="#FDE733") +
  theme_minimal()

Figure 2. Levels of variable Number of floors suggested by Decision Tree algorithm

formula <- Price_per_sqm ~ No_of_floors
tree1 <- rpart(formula, data = df_final, control=rpart.control(minsplit=10, cp=0.002, maxdepth = 3))
fancyRpartPlot(tree1, palettes = "Spectral")

df_final$No_of_floors_binned <- 
  ifelse(df_final$No_of_floors < 2, "<= 2", 
       ifelse(df_final$No_of_floors > 2 & df_final$No_of_floors <= 4, "> 2 & <= 4",
              ifelse(df_final$No_of_floors > 4 & df_final$No_of_floors <= 28, "> 4 & <= 28", 
                     ifelse(df_final$No_of_floors > 28, "> 28", 0))))

df_final$No_of_floors_binned <- as.factor(df_final$No_of_floors_binned)

Construction completion year

When analyzing variable indicating construction completion year, three outliers were noticed which seemed to be caused by mistyping and corrected. Additionally, decision has been made to create and use in the further modelling a variable indicating age of the building instead of the year it was built.

df_final$Year_built <- ifelse(df_final$Year_built==19885, 1985, df_final$Year_built)
df_final$Year_built <- ifelse(df_final$Year_built==20112, 2012, df_final$Year_built)
df_final$Year_built <- ifelse(df_final$Year_built==20110, 2010, df_final$Year_built)

df_final$Building_age <- 2020 - df_final$Year_built

Figure 3. Age of the apartment vs Price per sqm

ggplot(df_final, aes(x = Building_age, y = Price_per_sqm)) + 
  geom_point(color="#FDE733") +
  theme_minimal()

Figure 4. Levels of variable Age of apartment suggested by Decision Tree algorithm

formula <- Price_per_sqm ~ Building_age
tree2 <- rpart(formula, data = df_final, control=rpart.control(minsplit=10, cp=0.002, maxdepth = 3))
fancyRpartPlot(tree2, palettes = "Spectral")

df_final$Building_age_binned <- 
  ifelse(df_final$Building_age <= 0, "<= 0", 
       ifelse(df_final$Building_age > 0 & df_final$Building_age <= 12, "> 0 & <= 12",
              ifelse(df_final$Building_age > 12 & df_final$Building_age <= 64, "> 12 & <= 64", 
                     ifelse(df_final$Building_age > 64 & df_final$Building_age <= 105, "> 64 & <= 105",
                            ifelse(df_final$Building_age > 105, "> 105", 0)))))

df_final$Building_age_binned <- as.factor(df_final$Building_age_binned)

Days from the initial announcement

df_final$Days_from_initial_announcement <- ifelse(df_final$Days_from_initial_announcement=="Less than a day", 0, df_final$Days_from_initial_announcement)
df_final$Days_from_initial_announcement <- as.integer(df_final$Days_from_initial_announcement)

Figure 5. Days from initial announcement vs Price per sqm

ggplot(df_final, aes(x = Days_from_initial_announcement, y = Price_per_sqm)) + 
  geom_point(color="#FDE733") +
  theme_minimal()

Figure 6. Levels of variable Days from initial announcement of apartment suggested by Decision Tree algorithm

df_final$Days_from_initial_announcement_binned <- 
  ifelse(df_final$Days_from_initial_announcement == 0, "Less than a day", 
       ifelse(df_final$Days_from_initial_announcement > 0 & df_final$Days_from_initial_announcement <= 7, "Week or less than a week",
              ifelse(df_final$Days_from_initial_announcement > 7 & df_final$Days_from_initial_announcement <= 30, "Month or less than a month", 
                     ifelse(df_final$Days_from_initial_announcement > 30 & df_final$Days_from_initial_announcement <= 365, "Year or less than a year",
                            ifelse(df_final$Days_from_initial_announcement > 365, "Over a year", 0)))))

df_final$Days_from_initial_announcement_binned <- as.factor(df_final$Days_from_initial_announcement_binned)

Floor

df_final$Floor <- as.factor(df_final$Floor)
levels(df_final$Floor)

Figure 7. Count of specific levels of Floor of apartment

df_final %>% 
  group_by(Floor) %>% 
  count %>% 
  ggplot(aes(x = Floor, y = n)) + 
  geom_bar(stat = "identity", fill = "orange") +
  theme_minimal()

df_final$Floor_binned <- df_final$Floor
levels(df_final$Floor_binned) <- c("> 10","=> 1 & <= 5","> 5 & <= 10","=> 1 & <= 5","=> 1 & <= 5","=> 1 & <= 5","=> 1 & <= 5","> 5 & <= 10","> 5 & <= 10","> 5 & <= 10","> 5 & <= 10","parter","poddasze","suterena")
levels(df_final$Floor_binned)

Number of rooms

df_final$No_of_rooms <- as.factor(df_final$No_of_rooms)
levels(df_final$No_of_rooms)

Figure 8. Count of specific levels of Number of rooms in apartment

df_final %>% 
  group_by(No_of_rooms) %>% 
  count %>% 
  ggplot(aes(x = No_of_rooms, y = n)) + 
  geom_bar(stat='identity', fill = "orange") +
  theme_minimal()

df_final$No_of_rooms_binned <- df_final$No_of_rooms
levels(df_final$No_of_rooms_binned) <- c("1","2","3 or more","3 or more","3 or more","3 or more","3 or more","3 or more","3 or more","3 or more")
levels(df_final$No_of_rooms_binned)

1.1.3. Treatment of missing values

This section addresses missing data treatment. In Table 2 variables with missing values are presented along with the missing values ratio.

Table 2. Missing values

missings <- as.data.frame(colMeans(is.na(df_final))) 
missings["Variabe"] <- row.names(missings)
colnames(missings) <- c("Missings_ratio", "Variable")
missings %>% 
  dplyr::filter(Missings_ratio > 0.001) %>% 
  dplyr::select(Variable, Missings_ratio) %>%
  kable() %>% 
  kable_styling()
Variable Missings_ratio
Rent 0.5520198
Ownership 0.2860833
Floor 0.0290319
Construction_type 0.2051865
Finishing_condition 0.3057134
No_of_floors 0.0664325
Year_built 0.1516686
Heating 0.3544788
No_of_floors_binned 0.0664325
Building_age 0.1516686
Building_age_binned 0.1516686
Floor_binned 0.0290319

For qualitative variables like: form of ownership, floor, type of construction, finishing condition, heating, number of floors in the building, building age and floor, the decision has been made to make missing values an explicit factor level.

f_names <- c("Ownership", "Floor", "Construction_type", "Finishing_condition", "Heating", "No_of_floors_binned", "Building_age_binned", "Floor_binned")

df_final[,f_names] <- lapply(df_final[,f_names], function(x) fct_explicit_na(x, na_level = "Missing")) %>% as.data.frame

As to qualitative variables, the decision to fill the value of rent with predictions obtained with Mean Matching algorithm has been made. Value of rent depends on many external factors such as type of the entity managing the building, living area, number of people living in the property, type of heating (e.g. if urban then it is contained in the rent, if gas then it is paid separately to the gas plant). The decision has been made to use as a proxy of these factors the living area, number of rooms and also type of heating.

formula <- make.formulas(df_final[,c("Rent","Area","No_of_rooms_binned","Heating")])
imp <- mice(df_final[,c("Rent","Area","No_of_rooms_binned","Heating")],
            m=5,
            method="pmm",
            formulas = formula,
            seed=1234)

rent_imp <- rowMeans(imp[["imp"]][["Rent"]])
names(rent_imp) <- NULL

df_final$id <- 1:dim(df_final)[1]
miss_rent <- df_final %>% filter(is.na(Rent))
nonmiss_rent <- df_final %>% filter(!is.na(Rent))
nonmiss_rent$rent_imp <- NA

df_final_1 <- cbind(miss_rent, rent_imp)
df_final_2 <- rbind(nonmiss_rent, df_final_1)

df_final_2$Rent_imp <- ifelse(is.na(df_final_2$Rent), rent_imp, df_final_2$Rent)
df_final_2 <- df_final_2 %>% dplyr::select(id, Rent_imp)
df_final <- left_join(df_final, df_final_2, by="id") %>% dplyr::select(-id)

2. Exploratory Data Analysis

After the final data has been obtained there was a possibility to look at relations between selected variables and the districts of Warsaw. Figure 9 presents mean price per square metre of apartments in specific districts

Figure 9. Mean price per square metre in specific districts of Warsaw

Figure 1 

Rembertów, which is an east-border district of Warsaw occured to be the cheapest one place to live in the city in mean values. At the same time, Śródmieście, being the central region, has the highest mean prices, almost 3 times as high as in Rembertów.

Figure 10. Max price per square metre in specific districts of Warsaw

Figure 2 

The same pattern as in the case of mean prices, can be spotted for max prices. Again, Rembertów is clearly the cheapest one district and Śródmieście is the most expensive. The relation between two extreme districts is even more visible. Max prices in Śródmieście are about 5 times higher than in Rembertów.

From the analysis of mean and maximum prices a picture of Warsaw apartments market can be drawn. The prices, in general, increase to the central part of the city. Additionaly, the market is quite differentieted. One can also get an impression that the apartments located on the east side of the Vistula river are cheaper. Therefore, Auxiliary Hypothesis One is proposed.

Auxiliary Hypothesis One
The prices of the apartments located on the east side of Vistula river are, in general, cheaper


Figure 11. Mean apartment square metre area in specific districts of Warsaw

Figure 3

Mean apartment square area is a variable which delivers an interesting view on a differentation of square metres of apartments in districts. The south districts seem to have higher means of square metres of apartments.

Figure 12. Median year of construction of the apartments in specific districts of Warsaw

Figure 4 

Median year of construction can show how Warsaw evolved in time. The process, without suprise, escalated from the center of city - Śródmieście with apartments older than 50 years (with median threshold) to the borders of Warsaw, but not with the same speed.

Figure 13. Life quality measures in specific districts of Warsaw

Figure 5 

Life quality measures were introduced to capture the determinants of housing-prices widely present in the literature. Here, again we can spot the relation, which was identified for prices of square metres of apartments. Rembertów is the worst district in the case of presented measures and Śródmieście is on the top. Based on this observation, Auxiliary Hypothesis Two is introduced.

Auxiliary Hypothesis Two
There is a siginficant relation between prices of apartments and life quality measures in specific districts Warsaw


3. Modelling

Due to the nature of main hypotheses stated in Introduction, and Auxiliary hypotheses from Section 2, the modelling part is done separately for each hypothesis. Different modelling strategies are applied for a purpose of verification. The problem captured with the Hypothesis One is addressed with multiple modelling approaches, including OLS, spatial methods, and non-parametric algorithm. For the rest of hypotheses, their, relatively simple functional form, makes the OLS regression enough to conduct a meaningful verification process. In each subsection, after the modelling, the results of it are discussed and hypothesis is definitely verified.

3.1. Hypothesis One

Hypothesis One
Spatial-based variables are significant predictors of housing prices

In order to estimate housing prices, several models have been run. There are three types of models which are considered: regular OLS regression, spatial algorithm and non-parametric algorithm.

3.1.1. OLS Regression

First model run is linear regression with only spatial characteristics as independent variables. Stepwise regression in order to obtain the best model formula according to AIC criterion is employed.

ols_0 <- lm(Price_per_sqm ~ Distance_to_PKIN_in_km + Distance_to_subway_in_km + Distance_to_airport_in_km,
          data=df_final)

The adjusted R-squared of the model is 31.4%. All variables occurred to be significant according to p-value statistic. The model shows the negative relationship of the price per squared metre with distance to PKiN and distance to subway and positive relationship with distance to airport. The results are intuitive - the prices of properties are indeed higher in the centre and near subway station as it increases the property functionality, while prices of properties near airports are lower due to aircraft noise.

Table 3. Results of OLS regression for Hypothesis One only with spatial features included

step.model_0 <- step(ols_0, direction = "both", trace = FALSE)
step.model_0 %>% 
  tab_model()
  Price_per_sqm
Predictors Estimates CI p
(Intercept) 14880.38 14688.99 – 15071.77 <0.001
Distance_to_PKIN_in_km -546.64 -565.60 – -527.67 <0.001
Distance_to_subway_in_km -61.88 -90.02 – -33.73 <0.001
Distance_to_airport_in_km 110.47 92.58 – 128.36 <0.001
Observations 9679
R2 / R2 adjusted 0.315 / 0.314

To enrich the analysis additionally a model which in addition to spatial characteristics considers also chosen property characteristics has been run.

ols_1 <- lm(Price_per_sqm ~ 
              District + Rent_imp + Building_age_binned + Property_market +
              No_of_rooms + Distance_to_PKIN_in_km + Distance_to_subway_in_km +
              Distance_to_airport_in_km,
            data=df_final)

With the use of stepwise regression the final model formula was selected by the algorithm (variable Rent_imp was not chosen by the algorithm). It is clear that District is a significant factor – Białołęka, Rembertów, Wawer or Wesoła have lower property prices than Bemowo district being the base factor level. However in case of those districts closer to the center – Śródmieście, Ochota, Mokotów or, considered as family friendly – Ursynów, Wilanów or Bielany, the price is higher than in Bemowo district. It is intuitive and the model coefficients confirm that phenomenon. As to building age the base level are properties not yet built and thus it can be interpreted based on the model output that buildings built this year or 12 years old tend to have higher prices than those not yet built. Opposite relation can be seen in case of older buildings in case of which the price is lower than those not yet built. As to property market, properties from the secondary market tend to exhibit higher prices per squared metre than these from the primary market. As to number of rooms, the model indicated that for significant levels price for squared metre is always lower for apartment with more than one room. The influence of spatial variables is the same as in case of the first model.

Table 4. Results of OLS regression for Hypothesis One with all chosen features

step.model_1 <- step(ols_1, direction = "both", trace = FALSE)
step.model_1 %>% 
  tab_model()
  Price_per_sqm
Predictors Estimates CI p
(Intercept) 11957.44 11507.18 – 12407.70 <0.001
District [Białołęka] -968.55 -1327.50 – -609.61 <0.001
District [Bielany] 611.16 248.07 – 974.24 0.001
District [Mokotów] 1631.58 1344.93 – 1918.22 <0.001
District [Ochota] 1566.31 1162.05 – 1970.56 <0.001
District [Praga-Południe] -6.78 -342.39 – 328.83 0.968
District [Praga-Północ] 279.82 -140.78 – 700.42 0.192
District [Rembertów] -1879.77 -2581.01 – -1178.53 <0.001
District [Śródmieście] 5339.80 4968.31 – 5711.30 <0.001
District [Targówek] -488.48 -902.88 – -74.07 0.021
District [Ursus] -148.07 -510.32 – 214.18 0.423
District [Ursynów] 1372.91 1025.90 – 1719.92 <0.001
District [Wawer] -1106.54 -1604.72 – -608.36 <0.001
District [Wesoła] -1061.95 -1924.36 – -199.54 0.016
District [Wilanów] 1223.24 849.63 – 1596.85 <0.001
District [Włochy] -429.48 -861.08 – 2.12 0.051
District [Wola] 1118.02 791.92 – 1444.13 <0.001
District [Żoliborz] 1701.15 1293.71 – 2108.58 <0.001
Building_age_binned [> 0
& <= 12]
1493.58 1290.35 – 1696.82 <0.001
Building_age_binned [>
105]
462.06 -48.03 – 972.15 0.076
Building_age_binned [> 12
& <= 64]
-975.15 -1188.53 – -761.76 <0.001
Building_age_binned [> 64
& <= 105]
-527.11 -811.22 – -243.00 <0.001
Building_age_binned
[Missing]
-24.21 -225.10 – 176.68 0.813
Property_market [wtórny] 616.96 438.86 – 795.06 <0.001
No_of_rooms [2] -589.01 -778.42 – -399.61 <0.001
No_of_rooms [3] -1021.50 -1212.07 – -830.93 <0.001
No_of_rooms [4] -926.42 -1148.06 – -704.79 <0.001
No_of_rooms [5] -826.91 -1177.80 – -476.02 <0.001
No_of_rooms [6] -948.34 -1613.17 – -283.52 0.005
No_of_rooms [7] -1414.62 -2636.45 – -192.79 0.023
No_of_rooms [8] -468.12 -4190.95 – 3254.71 0.805
No_of_rooms [9] 237.03 -2118.30 – 2592.37 0.844
No_of_rooms [więcej niż
10]
-4764.89 -8538.81 – -990.97 0.013
Distance_to_PKIN_in_km -248.98 -284.26 – -213.69 <0.001
Distance_to_subway_in_km -58.80 -91.46 – -26.14 <0.001
Distance_to_airport_in_km 83.04 61.27 – 104.81 <0.001
Observations 9679
R2 / R2 adjusted 0.494 / 0.492

3.1.2. Spatial models

Another type of model which is considered is spatial model. The data analyzed is point data. Therefore, there was a need to somehow aggregate it. Instead of analyzing particular districts (only 18 observations), the decision has been made to divide Warsaw map into grid with cells of 1 km x 1 km and then aggregate the data within grid cells.

# loading shapefile with Warsaw districts polygons
districts <- readOGR("./shapefile", "dzielnice_Warszawy") 
districts <- spTransform(districts, CRS("+proj=longlat +datum=NAD83"))

# reading the grid for Poland and changing projections; limiting variables to only those indicating grids; 
# it takes a lot of time so we decided to save the key results of the code to RData object and just read the final limited objects
# pop <- readOGR("./PD_STAT_GRID_CELL_2011.shp", "PD_STAT_GRID_CELL_2011")
# pop <- spTransform(pop, CRS("+proj=longlat +datum=NAD83"))
# pop.df <- as.data.frame(pop) %>% dplyr::select(SHAPE_Leng, SHAPE_Area, CODE)

# separating grid
# pop.grid <- as(pop, "SpatialPolygons")

# saveRDS(pop.df, file = "pop.df.RDS") 
# saveRDS(pop.grid, file = "pop.grid.RDS") 

pop.df <- readRDS("pop.df.RDS") 
pop.grid <- readRDS("pop.grid.RDS") 

After the process of reading and transforming the data on grids, it was possible to limit grid data for whole Poland to only Warsaw. The result of the operation were 601 grids which can be seen in the Figure 14 compared to the plain districts visualization.

Figure 14. Warsaw Shapefile – grid representation

lim <- sp::over(pop.grid, districts) # overlay of grid and contour map

pop.df.warsaw <- pop.df[!is.na(lim$nazwa_dzie), ]
pop.grid.warsaw <- pop.grid[!is.na(lim$nazwa_dzie), ]
# length(pop.grid.warsaw) 601

par(mfrow=c(1,2))
plot(districts)
plot(pop.grid.warsaw)

As a next step there was a need to transform data on property characteristics to spatial object in order to merge it with the grid data. Limited number of features has been arbitrary selected for further modelling. These variables are rent, building age, distance to PKiN, distance to a nearest subway station and distance to the airport. Dependent variable was price of the property per squared metre. Average property price per squared metre per grid is presented in Figure 15.

Figure 15. Average property price per squared metre per grid in Warsaw

# property data transformation
dane.sp <- df_final
coordinates(dane.sp) <- c("Lon","Lat") 
proj4string(dane.sp) <- CRS("+proj=longlat +datum=NAD83")
pop.grid.warsaw <- spTransform(pop.grid.warsaw, CRS("+proj=longlat +datum=NAD83"))
dane.sp <- spTransform(dane.sp, CRS("+proj=longlat +datum=NAD83"))

# assigning points to grid
locs.lim <- over(dane.sp, pop.grid.warsaw)
df_final$grid <- locs.lim  

# preparing the objects of ordered data 
aaa1 <- lapply(pop.grid.warsaw@polygons, slot, "ID") # assigning slots to ID
aaa2 <- unlist(lapply(pop.grid.warsaw@polygons, slot, "ID")) # list of slots numbers
pop.df.warsaw$ID <- 1:601

# aggregating data by grid
# dependent variable
Price_per_sqm.agg <- aggregate(df_final$Price_per_sqm, by=list(df_final$grid), mean, na.rm=TRUE)
# potential independent variables
Rent.agg <- aggregate(df_final$Rent_imp, by=list(df_final$grid), mean, na.rm=TRUE)
Building_age.agg <- aggregate(df_final$Building_age, by=list(df_final$grid), mean, na.rm=TRUE)
Distance_to_PKIN_in_km.agg <- aggregate(df_final$Distance_to_PKIN_in_km, by=list(df_final$grid), mean, na.rm=TRUE)
Distance_to_subway_in_km.agg <- aggregate(df_final$Distance_to_subway_in_km, by=list(df_final$grid), mean, na.rm=TRUE)
Distance_to_airport_in_km.agg <- aggregate(df_final$Distance_to_airport_in_km, by=list(df_final$grid), mean, na.rm=TRUE)

# joining aggregated data with 
pop.df.warsaw <- dplyr::left_join(pop.df.warsaw, Price_per_sqm.agg, by = c("ID" = "Group.1")) %>% 
  dplyr::left_join(., Rent.agg, by = c("ID" = "Group.1")) %>% 
  dplyr::left_join(., Building_age.agg, by = c("ID" = "Group.1")) %>% 
  dplyr::left_join(., Distance_to_PKIN_in_km.agg, by = c("ID" = "Group.1")) %>% 
  dplyr::left_join(., Distance_to_subway_in_km.agg, by = c("ID" = "Group.1")) %>% 
  dplyr::left_join(., Distance_to_airport_in_km.agg, by = c("ID" = "Group.1"))
  
colnames(pop.df.warsaw)[5:10] <- c("Price_per_sqm", "Rent_imp", "Building_age", "Distance_to_PKIN_in_km", "Distance_to_subway_in_km", "Distance_to_airport_in_km")

pop.df.warsaw[,5:10][is.na(pop.df.warsaw[,5:10])] <- 0

choropleth(pop.grid.warsaw, pop.df.warsaw$Price_per_sqm)

After creation of final dataset it was possible to move to the modelling part. Contiguity spatial weights matrix and Manski model (containing spatial lag of the dependent variable, spatial lag of the explanatory variables and spatial lag of error) have been chosen. Below full model equation can be seen. The model is estimated with the use of spdep::sacsarlm() function.

\[Y = \rho WY + \alpha_{l_N} + X\beta + WX\theta + u\] \[u = \lambda Wu + \varepsilon \]

# contiguity spatial weights matrix for grid
cont.nb <- poly2nb(pop.grid.warsaw, queen=T)
cont.listw <- nb2listw(cont.nb, style="W") 

# model formula
formula <- formula(Price_per_sqm ~ Rent_imp + Building_age + Distance_to_PKIN_in_km + Distance_to_subway_in_km + Distance_to_airport_in_km) 

# Manski model
gns <- sacsarlm(formula, data=pop.df.warsaw, listw=cont.listw, type="sacmixed")

Since Manski model contains a spatial lag of the dependent variable, due to simultaneity instead of beta coefficients indirect and direct effects should be interpreted. * Direct effect indicate the influence of explanatory variable \(x\) in location \(i\) on the value of dependent variable in location \(i\). * Indirect effect indicate the influence of explanatory variable \(x\) in location \(i\) on the value of dependent variable in location \(j\). * Total effect is a sum of both direct and indirect effects and thus a total effect of explanatory variable, capturing the influence of explanatory variable \(x\) in location \(i\) on the value of dependent variable in locations \(i\) and \(j\).

All three combinations of effects can be seen on the output below: (1) when both effects are positive (Rent_imp, Building_age), (2) when both effects are negative (Distance_to_subway_in_km) and (3) when the direct effects are positive and the indirect effects are negative (Distance_to_PKIN_in_km, Distance_to_airport_in_km).

In case of Rent_imp and Building_age, regardless of location the effect of full support of processes can be seen. The increase of Rent_imp and Building_age in the region i and the increase of Rent_imp and Building_age in the vicinity of j translates into the increase of price per squared metre in the region i. In case of Distance_to_subway_in_km, regardless the location oppositely the effect of full weakening of processes can be seen. The increase of Distance_to_subway_in_km in the region i and an increase in Distance_to_subway_in_km in the vicinity of j translates into a decrease of price per squared metre in the region i. In case of Distance_to_PKIN_in_km and Distance_to_airport_in_km can be seen the leaching relation meaning that resources from one location work in favour of another location, weakening their own location. The increase of both Distance_to_PKIN_in_km and Distance_to_airport_in_km translates into increase of price per squared metre in the region i, while the increase in the location j translates into price per squared metre decrease in the region i.

Table 4. Results of selected spatial model for Hypothesis One

W.c <- as(as_dgRMatrix_listw(cont.listw), "CsparseMatrix") 
trMat <- trW(W.c, type="mult") 
gns_imp <- impacts(gns, tr=trMat, R=2000)
summary(gns_imp, zstats=FALSE, short=TRUE) 
## Impact measures (sacmixed, trace):
##                                  Direct    Indirect       Total
## Rent_imp                   6.300696e-03    1.237816    1.244117
## Building_age               4.746485e+01  225.495241  272.960088
## Distance_to_PKIN_in_km     5.925505e+02 -127.194253  465.356233
## Distance_to_subway_in_km  -3.850896e+02  -95.577282 -480.666880
## Distance_to_airport_in_km  1.489654e+02 -192.936136  -43.970763
## ========================================================
## Simulation results (asymptotic variance matrix):

One can also interpret ratios of effects. For effects in the same direction (Rent_imp, Building_age, Distance_to_subway_in_km), the share of effects will be interpreted. For opposite effects direction (Distance_to_PKIN_in_km, Distance_to_airport_in_km), the relative relationship of both effects will be examined.

For Rent the share of the direct effect is marginal (0.5%) while the share of the indirect effect is 99%. In case of building age, the share of the direct effect is approximately 17% and the share of indirect effect is 82.6%. In case of Distance_to_subway_in_km shares are inverted - the share of the direct effect is 80% while share of indirect effect is above 19%. When analysing ratios of effects for Distance_to_PKIN_in_km and Distance_to_airport_in_km, it can be seen that in case of Distance_to_PKIN_in_km the ratio is above 1 meaning strong internalization while in case of Distance_to_airport_in_km the ratio is below 1 meaning spillover however weaker than the internalization of Distance_to_PKIN_in_km variable.

direct <- gns_imp$res$direct
indirect <- gns_imp$res$indirect
total <- gns_imp$res$total

data.frame("Direct_share" = direct / total,
           "Inirect_share" = indirect / total,
           "Ratio" = abs(direct) / abs(indirect))
##                           Direct_share Inirect_share      Ratio
## Rent_imp                   0.005064391     0.9949356 0.00509017
## Building_age               0.173889333     0.8261107 0.21049157
## Distance_to_PKIN_in_km     1.273326635    -0.2733266 4.65862624
## Distance_to_subway_in_km   0.801156923     0.1988431 4.02909135
## Distance_to_airport_in_km -3.387827769     4.3878278 0.77209680

After obtaining and interpreting direct and indirect impacts, the assessment of spatial autocorrelation of residuals using I Moran statistics has been made. According to the results of the moran.test(), we cannot reject the null hypothesis on lack of spatial autocorrelation (random distribution of values) which means that our spatial model has successfully filtered the spatial relationship.

moran.test(gns$residuals, cont.listw)
## 
##  Moran I test under randomisation
## 
## data:  gns$residuals  
## weights: cont.listw    
## 
## Moran I statistic standard deviate = -0.43381, p-value = 0.6678
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.010911769      -0.001666667       0.000454170

3.1.3. Non-parametric models

The last type of model run is non-parametric model Gradient Boosting which is an ensemble algorithm based on boosting technique.

set.seed(1234)

vars <- c("Price_per_sqm", "Distance_to_PKIN_in_km", "Distance_to_subway_in_km", "Distance_to_airport_in_km")

df_final <- as.data.frame(df_final) %>% dplyr::select(-grid)

property.gbm <- gbm(Price_per_sqm~.,
                   data=df_final[,vars],
                   distribution="gaussian",
                   n.trees=1000,
                   shrinkage = 0.01,
                   interaction.depth=4)

On the output below can be seen relative importance of the three variables used in the modelling namely Distance_to_PKIN_in_km, Distance_to_subway_in_km and Distance_to_airport_in_km. It can be seen that variable the most influencing price of the property per squared metre is Distance_to_PKIN_in_km. All three predictors had however non-zero influence.

summary_gbm <- summary(property.gbm, plotit = FALSE)  
summary_gbm %>% 
  kable() %>% 
  kable_styling()
var rel.inf
Distance_to_PKIN_in_km Distance_to_PKIN_in_km 77.68177
Distance_to_airport_in_km Distance_to_airport_in_km 12.28839
Distance_to_subway_in_km Distance_to_subway_in_km 10.02984

3.2. Hypothesis Two

Hypothesis Two
The coronavirus pandemic had a negative impact on the housing-prices


To verify this hypothesis a simple approach has been introduced to have prices of apartments as the dependent variable and set binary dependent variable, indicating whether the offer was added before or after COVID pandemic in Poland, approximately, began.

Table 6. Results of OLS regression for Hypothesis Two

df_hyp2 <- df_final %>% dplyr::select(Total_price, Days_from_initial_announcement)

df_hyp2 <- df_hyp2 %>% mutate(Months_passed = round(Days_from_initial_announcement/30, 1))

df_hyp2 <- df_hyp2 %>% mutate(covid_time = ifelse(Months_passed <= 3, 1,0))

model_hyp2 <- lm(formula = Total_price ~ covid_time, data = df_hyp2)

model_hyp2 %>%
  tab_model() 
  Total_price
Predictors Estimates CI p
(Intercept) 913613.96 875413.30 – 951814.62 <0.001
covid_time -192085.08 -232361.59 – -151808.58 <0.001
Observations 9679
R2 / R2 adjusted 0.009 / 0.009


The reference level of the regression is 0, which means that a specific price was recorded before COVID pandemic. T-test indicates that the variable Covid time is significant. Its coefficient has a negative sign. This might point out that the prices of apartments after COVID pandemic begun are, indeed, lower than before it. Therefore, we cannot reject the null hypothesis of Hypothesis Two . However, the simple form of the approach proposed in this subsection does not come without limitations. The table below shows basic grouped statistics for observations labeled as before/after COVID pandemic began.

Table 7. Average price and count of observations before and after the start of COVID pandemic

df_hyp2 %>% 
  group_by(covid_time) %>% 
  summarise(Average_price = mean(Total_price), Count = n()) %>% 
  kable() %>% 
  kable_styling()
covid_time Average_price Count
0 913614.0 972
1 721528.9 8707


It can be easily spotted that average price of the offers posted before the pandemic is visibly higher. In the same time their count is much lower. The difference in average price could be explained with the fact of COVID pandemic. However, the clear difference in counts of both levels of variable might indicate that some of the offers posted after COVID pandemic began are not the first offers. In the other words, it is possible that the prices of some appartments are lower, because they had been first posted in otodom.pl before the pandemic, but had not been sold. Therefore, sellers lowered prices and posted new offers, after the start of pandemic. To improve the verification of Hypothesis Two , potential future researcher should control for a following feature:

  • Is an offer for specific apartment the first offer posted in Otodom?

Unfortunately, the data which was obtained for this study do not make it possible to include such a feature.

3.3. Auxiliary Hypotheses

For the verification of auxliary hypotheses the same OLS regression approach as in the case of Hypothesis Two is introduced

3.3.1. Auxiliary Hypothesis One

Auxiliary Hypothesis One
The prices of the apartments located on the east side of Vistula river are, in general, cheaper


Table 8. Results of OLS regression for Axuliary Hypothesis One

df_hyp3 <- df_final %>% dplyr::select(Total_price, Vistula_side)
model_hyp3 <- lm(formula = Total_price ~ Vistula_side, data = df_hyp3)
model_hyp3 %>%
  tab_model() 
  Total_price
Predictors Estimates CI p
(Intercept) 525623.75 502647.01 – 548600.49 <0.001
Vistula_side [1] 293734.65 266890.48 – 320578.82 <0.001
Observations 9679
R2 / R2 adjusted 0.045 / 0.045


The results of the regression show that the side of Vistula river the apartment is located in, has a significant impact on price. Also, the coefficent of the variable indicates that the apartments on the west side of river are more expensive. Therefore, Auxiliary Hypothesis One cannot be rejected.

3.3.2 Auxiliary Hypothesis Two

Auxiliary Hypothesis Two
There is a siginficant relation between prices of apartments and life quality measures in specific districts Warsaw


Table 8. Results of OLS regression for Auxiliary Hypothesis Two

df_hyp4 <- df_final %>% 
              group_by(District) %>% 
              summarise(Average_price = mean(Total_price), 
                        Average_life_quality = mean(Life_quality) 
                        ) %>% dplyr::select(Average_price, Average_life_quality)  

model_hyp4 <- lm(formula = Average_price ~ Average_life_quality, data = df_hyp4)

model_hyp4 %>%
  tab_model() 
  Average_price
Predictors Estimates CI p
(Intercept) 197072.97 10168.21 – 383977.72 0.040
Average_life_quality 1120624.92 715013.13 – 1526236.72 <0.001
Observations 18
R2 / R2 adjusted 0.682 / 0.662


In this case the regression was conducted on aggregated data. All the observations were grouped on district level to create a dependent variable – Average price of apartment in specific district, and independent one – Average life quality in specific district. The latter proved to be significant. Then, Auxiliary Hypothesis Two cannot be rejected.

Conclusion

The main idea of this work was to study the determinants of apartment prices in Warsaw. Variety of different aspects of this issue have been studied. The hypotheses and the conclusions, coming from their verification, are presented in the list below:

Bibliography


  1. Nominatim. (2020). https://nominatim.org/release-docs/latest/↩︎

  2. Project-OSRM. (2020). http://project-osrm.org/docs/v5.5.1/api/#general-options↩︎

  3. D’Acci, L. (2018). Quality of urban area, distance from city centre, and housing value. Case study on real estate values in Turin. Cities. Retrieved from: https://www.sciencedirect.com/science/article/abs/pii/S0264275118308552?via%3Dihub↩︎

  4. “Ranking dzielnic Warszawy pod względem atrakcyjności warunków życia”, GUS Warszawa, source: https://warszawa.stat.gov.pl/publikacje-i-foldery/warunki-zycia/ranking-dzielnic-warszawy-pod-wzgledem-atrakcyjnosci-warunkow-zycia,1,2.html↩︎