The real estate market plays a crucial role in economic growth, urban development, and investment strategies. Understanding its structure, pricing trends, and key influencing factors is essential for stakeholders, including buyers, investors, and policymakers. This project provides an in-depth analysis of the real estate market using statistical methods and data visualization techniques to extract meaningful insights from the available data.
The following milestones were outlined during design process:
Explatory Analysis: Understanding the format, types, and sources of data.
Data Cleansing and Wrangling: Ensuring data accuracy, handling missing values, and standardizing formats.
Outliers: Identifying and analyzing unusual price points that may affect overall trends.
Visualizations: Graphical representations to facilitate pattern recognition and trend analysis.
Descriptive Analysis: Exploring key metrics like mean, median, variance, and skewness, recovered aggregated statistical insights that describe the market dynamics.
Price Distribution Shape: Analyzing the distribution of prices through density plots and statistical summaries.
Statistical Tests: Assessing the statistical significance of median price differences across cities and investigating the impact of amenities on price variations.
Provide a clear understanding of real estate price structures.
Identify significant price differentiators and market trends.
Utilize statistical techniques to validate insights.
Present findings through comprehensive visual and tabular representations.
The dataset contains apartment sales and rental offers from 15 largest cities in Poland (Warsaw, Lodz, Krakow, Wroclaw, Poznan, Gdansk, Szczecin, Bydgoszcz, Lublin, Katowice, Bialystok, Czestochowa). The data comes from local websites with apartments for sale. In order to fully capture the neighborhood of each apartment, each listing has been augmented with Open Street Map data with distances to points of interest (POIs). The data is collected monthly and covers the period from August 2023 to June 2024.
apartments_pl_YYY_MM.csv - monthly snapshot of sales listings
apartments_rent_pl_YYYY_MM.csv - monthly snapshot of rental offers
Important data manipulations conducted in this step are adding months of when the data come from to the dataset, and binding files. In the raw state of data, this date is only included in the name of the file, and also we have separate files for each month.
Data cleansing and wrangling are essential steps in the data analysis process, ensuring that raw data is transformed into a structured, accurate, and usable format. Data cleansing involves among others handling missing values, outliers and identifying and correcting errors. Data wrangling focuses on reshaping, merging, and transforming datasets to make them suitable for analysis.
The goal of these processes is to enhance data integrity, eliminate inconsistencies, and prepare the dataset for meaningful insights.
Row IDs are unique and therefore provide no differentiation or analytical value. Additionally, due to the nature of the data, duplicate values are possible. For example, if a property was listed for sale in January but sold in March, it would appear three times—once for each month it was listed.
A brief exploration of the dataset reveals that both the rent and buy datasets share the same 28 features, which include both categorical and continuous variables. The buy dataset contains 92 967 unique records, while the rent dataset has 37 941 unique records. An interesting observation is that the data includes not only missing values (NAs) but also different type of blank entries, which will be addressed in the next section.
| variable | n_miss | pct_miss |
|---|---|---|
| condition | 69914 | 75.2 |
| buildingMaterial | 39179 | 42.1 |
| type | 19792 | 21.3 |
| floor | 15984 | 17.2 |
| buildYear | 15641 | 16.8 |
| hasElevator | 4448 | 4.78 |
| collegeDistance | 2492 | 2.68 |
| floorCount | 1082 | 1.16 |
| clinicDistance | 333 | 0.358 |
| restaurantDistance | 226 | 0.243 |
| pharmacyDistance | 128 | 0.138 |
| postOfficeDistance | 99 | 0.106 |
| kindergartenDistance | 85 | 0.0914 |
| schoolDistance | 60 | 0.0645 |
| city | 0 | 0 |
| squareMeters | 0 | 0 |
| rooms | 0 | 0 |
| latitude | 0 | 0 |
| longitude | 0 | 0 |
| centreDistance | 0 | 0 |
| poiCount | 0 | 0 |
| ownership | 0 | 0 |
| hasParkingSpace | 0 | 0 |
| hasBalcony | 0 | 0 |
| hasSecurity | 0 | 0 |
| hasStorageRoom | 0 | 0 |
| price | 0 | 0 |
| month | 0 | 0 |
| variable | n_miss | pct_miss |
|---|---|---|
| condition | 27790 | 73.2 |
| buildingMaterial | 16157 | 42.6 |
| buildYear | 10570 | 27.9 |
| type | 9190 | 24.2 |
| floor | 4646 | 12.2 |
| hasElevator | 2096 | 5.52 |
| floorCount | 730 | 1.92 |
| collegeDistance | 490 | 1.29 |
| restaurantDistance | 82 | 0.216 |
| pharmacyDistance | 38 | 0.100 |
| clinicDistance | 32 | 0.0843 |
| kindergartenDistance | 28 | 0.0738 |
| postOfficeDistance | 19 | 0.0501 |
| schoolDistance | 11 | 0.0290 |
| city | 0 | 0 |
| squareMeters | 0 | 0 |
| rooms | 0 | 0 |
| latitude | 0 | 0 |
| longitude | 0 | 0 |
| centreDistance | 0 | 0 |
| poiCount | 0 | 0 |
| ownership | 0 | 0 |
| hasParkingSpace | 0 | 0 |
| hasBalcony | 0 | 0 |
| hasSecurity | 0 | 0 |
| hasStorageRoom | 0 | 0 |
| price | 0 | 0 |
| month | 0 | 0 |
Both data sets have some serious problems with missing data, especially in condition and building material.
Data set ‘buy’
An analysis of missing values in the ‘buy’ dataset using an upset plot reveals some connections between condition, building material, and type. However, given the scale of missing values in these features, their mutual occurrence does not appear to be significant. On the other hand, in Gdańsk and Gdynia, the ‘condition’ attribute is missing more frequently than in other cities, while in Częstochowa, the ‘build year’ is notably absent more often.
Data set ‘rent’
A similar pattern is observed in the ‘rent’ dataset. Once again, some connections between condition, building material, and type are visible, but they do not appear to be significant. In this case, missing condition data is noticeable not only in Gdańsk and Gdynia but also in Częstochowa, where the ‘build year’ attribute is also frequently missing.
In the next steps will be checked correlation between NAs and other variable, for this purpose we will conduct below steps:
code NAs to 1 (in case of NA) and 0
code binary, categorical variables
exclude non binar variables
Correlation matrix doesn’t show any strong dependencies between missing and other variables
For finding outliers we used couple methods:
Grubbs’ test
boxplots
Interqunatile Range Method - Q-Q plot
We paid special attention to price since this is the subject of our research. Grubbs’ test allows us defining features with potential outliers:
| Variable | p-value |
|---|---|
| floor | 0 |
| floorCount | 1.00849896211841e-07 |
| poiCount | 0 |
| schoolDistance | 0 |
| postOfficeDistance | 0 |
| kindergartenDistance | 0 |
| restaurantDistance | 0 |
| pharmacyDistance | 0 |
| price | 2.03623467800451e-05 |
| Variable | p-value |
|---|---|
| floor | 0 |
| floorCount | 2.41967365033346e-08 |
| buildYear | 0.0077240047029381 |
| centreDistance | 0.0210443044676376 |
| poiCount | 2.45829663292341e-08 |
| schoolDistance | 0 |
| clinicDistance | 0.000174023317614402 |
| postOfficeDistance | 0 |
| kindergartenDistance | 0 |
| restaurantDistance | 0 |
| pharmacyDistance | 0 |
| price | 0 |
The strongest outliers are suspected in floor, price, and proximity-related variables (schoolDistance, postOfficeDistance, etc.), suggesting unusually high or low values in these attributes. Outliers in rental prices could indicate luxury apartments or extremely cheap listings. Floor-related outliers might suggest penthouse apartments or ground-floor units with unusual characteristics. The rental data set has more variables with outlier candidates.
Outliers visualization for the ‘buy’ data set
Outliers visualization for the ‘rent’ data set
Some apartments are located in remote areas with poor access to these public facilities. Outliers in floor and floorCount variables indicate penthouses or very high-floor apartments in skyscrapers. High value of PoI suggests that some apartments are located in highly urbanized areas with dense amenities, while high prices indicate luxury apartments or highly premium properties. The boxplots exhibit a dense concentration of outliers, with no observations noticeably separated from the rest. Above mentioned suggests that the detected outliers are not due to erroneous data but rather a result of a fat-tailed distribution in the given features.
## [1] "Price outliers related to the original dataset - buy 6.09%"
## [1] "Price outliers related to the original dataset - rent 7.57%"
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 30685 | katowice | 346 | 30.36 | 4 |
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 32075 | warszawa | 23000 | 148.0 | 62 |
| 33121 | warszawa | 23000 | 147.6 | 60 |
| 33521 | warszawa | 23000 | 126.0 | 100 |
While 346 PLN for rent may seem low, there are multiple observations close to this price. There are 3 observations with price of 23 000 PLN, all located in the center of Warsaw, where there is a high density of points of interest (POI).
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 18202 | bydgoszcz | 150000 | 31.02 | 8 |
| 18206 | bydgoszcz | 150000 | 35.94 | 8 |
| city | price | squareMeters | poiCount | |
|---|---|---|---|---|
| 79943 | warszawa | 3250000 | 133.51 | 9 |
| 80351 | warszawa | 3250000 | 136.00 | 2 |
| 84104 | warszawa | 3250000 | 133.51 | 13 |
| 129837 | warszawa | 3250000 | 131.70 | 26 |
For apartments listed for sale, neither a price of 1,500,000 PLN nor 3,250,000 PLN appears alarming or unusual. There are multiple listings with prices close to these values, suggesting they are within a reasonable range.
An additional confirmation of the price distribution would be a Q-Q plot. Since apartment prices are typically log-normally distributed, a Q-Q plot of log-transformed rent prices would help assess their conformity to this distribution.
In our case distributions are fat-tailed.
Data validation ensures data correctness by verifying logical consistency and identifying anomalies. In this data set, we checked the following properties:
Floor number is less than or equal to the total floor count.
Build year is less than 2024.
Latitude and longitude fall within Poland’s geographic range.
Since there are no incorrect data or outliers, we can proceed with imputing missing values using the MICE method, specifically the classification and regression trees (CART) variation. First, we encode text and categorical variables to make them suitable for imputation. Next, we perform the imputation, verify its success, and finally decode the variables back to their original format.
To avoid retraining the model each time, we will save the imputation results and disable the relevant cells from running in every Markdown execution.
Data was successfully imputed
There are highest prices for housing purchasing in the largest Polish cities as it follows from the graph, where in red marked cities with median prices above overall median price. The highest prices are in Warsaw, Krakow and Gdansk, while the cheapest housings are in Czestochowa, Radom and Bydgoszcz. The rent prices are a little more close in terms of the median values, nevertheless there is an outlier - capital city Warsaw, where the median is over 30% higher than in other cities.
Purchase data set
Buy data set
Buy data set
Rental data set
The presence of amenities increases the price of housing, while there is an outlier amenity - storage room: the median price for housings with a storage room is lower than without it. The difference in price for the categories is plotted on the next graph.
Amenities (all but storageRoom) seem to have a noticeable positive impact on price of the apartment. A lower purchase price for housings with storage rooms may be caused by hidden additional costs, which means that buyers are forced to pay extra for a storage room. The phenomena requires further analysis with broader range of data.
Purchase data set
There is no strong correlation found between the level of prices and distance to public places. However there is pretty solid positive correlation between variables of distances to public places.
Rental data set
Similar trends are observed on a rent market data set.
Interesting things can be observed – model is less certain about the smooth price for apartments for rent built in the 70s, which might suggest there’s less of them available for rent. This phenomena is correlated with modern market rules, as the demand on the new apartments is higher. The apartments in very old buildings tend to be very expensive, but there are not many of them. Commie blocks are the least expensive for both rent and purchase. Out of modern buildings those built around 2508 seem to be the most expensive. The prices for the newest buildings seem to be falling with each year.
There is no significant influence of housing condition on the price level, nevertheless its visible that premium housings are a little bit more expensive. It can be noticed that the figure of premium condition has a more curvy shape, which can be explained by larger number of observation.
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | city [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | type [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | squareMeters [numeric] |
|
7173 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | rooms [integer] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | floor [integer] |
|
27 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | floorCount [integer] |
|
29 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | buildYear [integer] |
|
165 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | latitude [numeric] |
|
46449 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | longitude [numeric] |
|
48233 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | centreDistance [numeric] |
|
1487 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | poiCount [integer] |
|
196 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | schoolDistance [numeric] |
|
2505 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | clinicDistance [numeric] |
|
4263 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | postOfficeDistance [numeric] |
|
2689 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | kindergartenDistance [numeric] |
|
2309 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | restaurantDistance [numeric] |
|
2406 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | collegeDistance [numeric] |
|
4817 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | pharmacyDistance [numeric] |
|
2420 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | ownership [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | buildingMaterial [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | condition [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | hasParkingSpace [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | hasBalcony [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | hasElevator [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | hasSecurity [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | hasStorageRoom [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | price [integer] |
|
7287 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | month [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | price_per_m2 [numeric] |
|
47186 distinct values |
Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-02
Summary table contains information about every column in the purchase data set. It can be noticed from the table that
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | city [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | type [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | squareMeters [numeric] |
|
3260 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | rooms [integer] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | floor [integer] |
|
27 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | floorCount [integer] |
|
30 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | buildYear [integer] |
|
148 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | latitude [numeric] |
|
22850 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | longitude [numeric] |
|
23409 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | centreDistance [numeric] |
|
1302 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | poiCount [integer] |
|
188 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | schoolDistance [numeric] |
|
1667 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | clinicDistance [numeric] |
|
3162 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | postOfficeDistance [numeric] |
|
1878 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | kindergartenDistance [numeric] |
|
1464 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | restaurantDistance [numeric] |
|
1455 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | collegeDistance [numeric] |
|
4153 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | pharmacyDistance [numeric] |
|
1489 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | ownership [character] | 1. condominium |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | buildingMaterial [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | condition [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | hasParkingSpace [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | hasBalcony [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | hasElevator [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | hasSecurity [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | hasStorageRoom [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | price [integer] |
|
996 distinct values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | month [character] |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | price_per_m2 [numeric] |
|
9600 distinct values |
Generated by summarytools 1.0.1 (R version 4.4.1)
2025-02-02
Rental market data in general follow the same trend, however there are some differences:
## [1] "TAI (tabular accuracy index)"
## # classes Goodness of fit Tabular accuracy
## 41.0000000 0.9947621 0.9148005
The price has skewed distribution distribution. A histogram was build to show how many housings fall in certain price levels from minimum price 150000 up to the maximum 3250000 by 100000 PLN. The histogram shows that most represented category is 600-700 thousands PLN. The partition accuracy was checked with the tabular accuracy index, which shows high tabular accuracy and goodness of fit.
## [1] "TAI (tabular accuracy index)"
## # classes Goodness of fit Tabular accuracy
## 46.0000000 0.9959035 0.9154236
The shapes of the rental market distribution is narrower than purchase market one, which can be explained by the target groups of these offers (often these group cannot afford to buy a housing).
This image shows density plots for the price per square meter in both the purchase and rental markets. The distribution graphs of prices per square meter are less skewed than normal prices and additionally:
Purchase Market
The distribution appears right-skewed (positively skewed), meaning more properties have prices below the median, with some high-priced outliers.
The median price per square meter is 13,322.98 PLN.
There’s a peak around 10,000-15,000 PLN, suggesting a common price range.
Rental Market
Also right-skewed, but with a sharper peak and a steeper decline.
The median rental price per square meter is 65.67 PLN.
Most rentals are concentrated in the 40-80 PLN per square meter range.
| apartmentBuilding | blockOfFlats | tenement | Sum | |
|---|---|---|---|---|
| bialystok | 11.50 | 85.25 | 3.24 | 99.99 |
| bydgoszcz | 8.77 | 65.95 | 25.28 | 100.00 |
| czestochowa | 7.33 | 85.35 | 7.33 | 100.01 |
| gdansk | 19.30 | 64.85 | 15.85 | 100.00 |
| gdynia | 20.62 | 67.39 | 11.99 | 100.00 |
| katowice | 19.40 | 56.86 | 23.74 | 100.00 |
| krakow | 20.05 | 65.68 | 14.27 | 100.00 |
| lodz | 15.07 | 69.70 | 15.23 | 100.00 |
| lublin | 3.48 | 87.12 | 9.40 | 100.00 |
| poznan | 14.87 | 64.91 | 20.23 | 100.01 |
| radom | 3.08 | 89.23 | 7.69 | 100.00 |
| rzeszow | 12.73 | 82.94 | 4.33 | 100.00 |
| szczecin | 8.04 | 62.08 | 29.88 | 100.00 |
| warszawa | 26.58 | 60.10 | 13.32 | 100.00 |
| wroclaw | 24.76 | 55.13 | 20.12 | 100.01 |
The table shows percentage of housing of the certain type per city. There are cities with higher percentage of apartment buildings (Gdansk, Gdynia, Krakow, Warszawa and Wroclaw), which are characterized by greater market share and higher prices, while in smaller cities with lower prices relative quantity of offers in blocks of flats is higher (Bialystok, Czestohchowa, Lublina, Radom). Nonetheless there is a common trend for each of the cities: the majority of housings is offered in blocks of flats, while apartments of buildings and tenements make up a smaller group of housings.
| apartmentBuilding | blockOfFlats | tenement | Sum | |
|---|---|---|---|---|
| bialystok | 26.01 | 70.40 | 3.59 | 100.00 |
| bydgoszcz | 21.07 | 57.73 | 21.20 | 100.00 |
| czestochowa | 16.52 | 72.17 | 11.30 | 99.99 |
| gdansk | 38.86 | 51.71 | 9.43 | 100.00 |
| gdynia | 26.92 | 61.51 | 11.57 | 100.00 |
| katowice | 39.99 | 48.80 | 11.22 | 100.01 |
| krakow | 41.35 | 46.17 | 12.48 | 100.00 |
| lodz | 45.86 | 45.67 | 8.46 | 99.99 |
| lublin | 19.30 | 67.25 | 13.45 | 100.00 |
| poznan | 31.75 | 56.52 | 11.73 | 100.00 |
| radom | 20.00 | 72.17 | 7.83 | 100.00 |
| rzeszow | 28.71 | 66.13 | 5.16 | 100.00 |
| szczecin | 27.96 | 59.67 | 12.38 | 100.01 |
| warszawa | 49.34 | 36.02 | 14.64 | 100.00 |
| wroclaw | 45.61 | 45.07 | 9.33 | 100.01 |
The situation is somewhat different for the rental market. There is an increase in apartment buildings type offers resulting in an incline in percentage of block of flats. Quantity of offers in tenements is also reduced insignificantly.
| type | mean | sd | median | min | max | skewness | kurtosis |
|---|---|---|---|---|---|---|---|
| apartmentBuilding | 16835.35 | 4883.377 | 16614.36 | 4137.931 | 32096.77 | 0.3299874 | 2.672051 |
| blockOfFlats | 12820.99 | 4233.305 | 12531.38 | 3435.864 | 30935.48 | 0.5797487 | 3.329609 |
| tenement | 13848.69 | 5904.646 | 12983.91 | 3000.000 | 31041.67 | 0.4951448 | 2.487640 |
The most expensive in terms of price per square meter are housings in apartment buildings. Their average price is 26.8% more for blocks of flats and 25.1% for tenements in purchasing. The smallest standard deviation** value for blocks of flats in both markets can be explained by average characteristics of this type of housing: locations and conditions are close in different blocks of flats or apartments, while there are old, cheaper tenements and relatively small amount of the new ones in perfect condition and location so the deviation is significantly larger. Positive values of skewness indicate the tail of the distribution is on the right, which means the amount of more expensive apartments is fading more gradually than the cheaper ones. Kurtosis indicates quite peaked distributions for each category.
| type | mean | sd | median | min | max | skewness | kurtosis |
|---|---|---|---|---|---|---|---|
| apartmentBuilding | 75.76773 | 23.22751 | 72.37607 | 11.30396 | 187.8378 | 1.1240927 | 5.142059 |
| blockOfFlats | 61.56362 | 19.15648 | 59.25926 | 11.29861 | 189.4737 | 0.9502302 | 5.420271 |
| tenement | 70.01260 | 24.04812 | 67.30769 | 11.29971 | 189.3146 | 0.7050550 | 3.895923 |
Apartments in apartment buildings have the highest average rent (75.77 PLN/m²) and median (72.38 PLN/m²), suggesting they tend to be more expensive. Block of flats has the lowest average (61.56 PLN/m²) and median (59.26 PLN/m²), meaning they are generally more affordable, while tenements are in between as it was for purchasing data set. Standard deviation is highest for tenements (24.05 PLN/m²) and apartment buildings (23.23 PLN/m²), indicating a wider spread in rental prices. The maximum rental price is quite high for all categories (~189 PLN/m²), suggesting some luxury or premium properties significantly impact the upper range. Apartment buildings (5.14) and block of flats (5.42) have higher kurtosis, indicating a sharper peak with more extreme values. Tenements (3.90) have a more moderate kurtosis, meaning a more balanced distribution without extreme outliers.
This image presents density plots for the distribution of price per square meter in both the purchase and rental markets, categorized by housing type (apartment buildings, blocks of flats and tenements). Purchase prices are more spread out than rental prices, particularly for apartment buildings. Rental prices are relatively more concentrated, with fewer extreme values. Blocks of flats consistently show the lowest prices and least variation, making them the most affordable option. Apartment buildings are the most expensive, both in purchase and rental markets, with more high-priced outliers.
The majority of housings is build of brick, while this material is used more frequently in apartment buildings and tenements than in blocks of flats, which are made of concrete slab in one third of cases. The trend is common for purchasing and rental markets.
Statistical tests are conducted to determine whether observed data patterns are due to chance or represent real effects. They help in making objective decisions, validating hypotheses, and drawing conclusions in research by measuring relationships, differences, or associations within data sets.
The main purpose of the statistical test we are going to perform is to determine which variables have a statistically significant impact on both rent and buy prices.
Based on Welch’s test, we obtained a p-value close to 0, so we reject the null hypothesis of equal means between populations in both the buy and rent data sets. This means that the city statistically significantly differentiates both rent and buy prices. However, an effect size of 0.86 implies that this differentiation has a moderate magnitude.
All amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.14 for a balcony to -0.57 for an elevator.
Similarly to buy prices, all amenities appear to statistically significantly differentiate buy price, but their effect sizes are small, ranging from -0.10 for a balcony to -0.31 for an elevator.
Finally, condition of flat in both rent and buy data sets also seem to differentiate price significantly but has only moderate magnitude.
This project provides a comprehensive analysis of the real estate market, utilizing statistical methods and visualizations to uncover key trends and insights. The study examines data structure, performs data cleansing, and addresses outliers to ensure accuracy. Through descriptive analysis and summary statistics, the project identifies variations in market share across cities and explores price distribution patterns.
The following problems were explored among others:
A data-driven overview of real estate pricing trends.
Identification of key factors influencing property prices.
Insights into how housing types and amenities impact pricing.
Statistical validation of differences in pricing across various cities.
Certain cities (e.g., Gdańsk, Gdynia, Częstochowa) show significant missing data in condition and build year attributes, however there is no strong correlation between missing values and other features. The highest property prices are in Warsaw, Krakow, and Gdansk, while the cheapest cities are Czestochowa, Radom, and Bydgoszcz. Property prices follow a fat-tailed distribution, while high-priced properties are concentrated in city centers with dense places of interest (PoI). No strong correlation is found between price levels and proximity to public places, but distances between amenities show a strong relationship.
Rental and purchase price distributions by cities, amenities and condition differ significantly, as confirmed by Welch’s test (p-value ≈ 0). Rental prices are more uniform, except in Warsaw, where the median rent is 30% higher than in other cities. A positive correlation exists between apartment size and price, with larger price increases in cities with higher property costs. Amenities generally increase housing prices, except for storage rooms, which correlate with lower prices, but the effect size is small (e.g., -0.14 for balconies, -0.57 for elevators). Housing condition has a moderate impact on prices, but its effect size is limited.
The study confirms that city location, property size, and amenities significantly influence real estate prices. While rental prices show more uniformity, purchase prices exhibit stronger variations based on property type and city. Warsaw dominates both markets, with modern buildings losing value over time. Statistical tests validate these findings, providing insights for investors, buyers, and policymakers.