Introduction

Report Objective

The purpose of this report is to analyze the dataset apartaments_pl_2024_06. To achieve this, the data will be cleaned, missing values will be filled, and an analysis with visualizations of the results will be conducted.

Dataset Overview

The dataset apartaments_pl_2024_06 contains information about the housing market in Poland’s largest cities. It focuses on property price, apartment size, proximity to urban infrastructure, and the year of construction.

Collected Data The dataset includes the following variables:

Variables descrbing location of the apartament:

id – A computer-generated unique identifier for each property

city – The city where the property is located

latitude – Geographic latitude

longitude – Geographic longitude

Variables describe the price and ownership of the property:

price – The price of the property

ownership – Ownership status of the property

Variables that describe the type and size of the apartment:

type – The type of building: Block of flats, apartment building, tenement

buildingMaterial – The material used for constructing the building condition – The current condition of the property

squareMeters – The total area of the property (in square meters)

rooms – The number of rooms in the property

floor – The floor on which the property is located

floor count – The total number of floors in the building

Variables that describe the distance from crucial amenities within the city:

centreDistance – Distance from the city center

poiCount – The number of points of interest within a relevant proximity from the property

schoolDistance – Distance to the nearest school

clinicDistance – Distance to the nearest medical clinic

postOfficeDistance – Distance to the nearest post office

kindergartenDistance – Distance to the nearest kindergarten

restaurantDistance – Distance to the nearest restaurant

collegeDistance – Distance to the nearest college/university

pharmacyDistance – Distance to the nearest pharmacy

Variables that describe weather the property or the building where the property is located includes certain infrastructure:

hasParkingSpace – Whether the property includes a parking space (Yes/No)

hasBalcony – Whether the property has a balcony (Yes/No)

hasElevator – Whether the building has an elevator (Yes/No)

hasSecurity – Whether the property/building has security features (Yes/No)

hasStorageRoom – Whether the property includes a storage room (Yes/No)

Data Cleansing and Wrangling

Installation of needed libraries and data

Visualizing the missing data using missing-data map

# Missing Data Map
vis_miss(apartments)

# UpSet plot for missing data
library(naniar)
gg_miss_upset(apartments, nsets = 3)

The resulting graph shows that the data Missing from the data-set amount to approximate 6,3% of all data. Most of the missing data comes from Variables condition amounting to 74% and building material 41%. Relevant deficits of data can also be found in variables: type, floor and buildYear at 20% ; 17% ; 16% missing data respectively.

The missing data in the data-set may stem from several factors. The high percentage of missing values in the condition and buildingMaterial variables suggests that such information is often not disclosed by property owners or real estate listings. Similarly, gaps in type, floor, and buildYear may result from incomplete records in older properties or inconsistencies in how data is reported across different cities. Missing values in floor could also be attributed to properties listed as houses or ground-floor apartments, where this information is not relevant. In some cases, property details might be deliberately omitted by sellers to make listings more appealing or due to a lack of accurate records. Additionally, data collection methods—such as web scraping from real estate portals—could contribute to missing values if certain details are not consistently provided across different platforms.

Filling missing data using hotdeck

#Using hotdeck to fill missing data
czyste <- hotdeck(apartments)

n_miss(czyste)  # Count of NA values: 0

## [1] 0

n_complete(czyste)  # Count of complete values: 1204056

## [1] 1204056

pct_miss(czyste)  # Percentage of NA values: [1] 0

## [1] 0

The core of the analysis

The key variable to analyse within the data-set is price as it describes the situation on the polish housing market which the data-set is based on.

descriptive_stats <- czyste %>%
  select(price) %>%
  summarise(
    Mean = format(mean(price, na.rm = TRUE), big.mark = ".", scientific = FALSE, digits = 0),
    Median = format(median(price, na.rm = TRUE), big.mark = ".", scientific = FALSE, digits = 0),
    Std_Dev = format(sd(price, na.rm = TRUE), big.mark = ".", scientific = FALSE, digits = 0),
    Min = format(min(price, na.rm = TRUE), big.mark = ".", scientific = FALSE, digits = 0),
    Max = format(max(price, na.rm = TRUE), big.mark = ".", scientific = FALSE, digits = 0),
  
  )

# Print the results
print(descriptive_stats)

##             N       Mean       SD      Min     Q1 Median     Q3   Max
## 1 price 21501   823867.9 431126.7   191000 549000 721824 965000 3e+06

skew_value <- skewness(czyste$price, na.rm = TRUE)
kurt_value <- kurtosis(czyste$price, na.rm = TRUE)

print(paste("Skewness: ", skew_value))

## [1] "Skewness:  1.76338927496389"

print(paste("Kurtosis: ", kurt_value))

## [1] "Kurtosis:  7.1206736064391"

The results of skewness indicate that data has some high-price outliners and that the low price properties are more concentrated The kurtosis at value 7.12 indicate a large number of outliers within the data-set

To Analyse the state of the polish housing market the apartaments are clustered into prince ranges:

limits <- c(0, 250000, 500000, 750000, 1000000, 1250000, Inf)

# Generate labels
labels <- c(
  "<250,000 PLN", 
  "250,000 - 499,999 PLN", 
  "500,000 - 749,999 PLN", 
  "750,000 - 999,999 PLN", 
  "1,000,000 - 1,249,999 PLN", 
  "≥1,250,000 PLN"
)

# Categorize apartment prices into bins
czyste$price_category <- cut(
  czyste$price, 
  breaks = limits, 
  labels = labels, 
  include.lowest = TRUE, 
  right = FALSE
)

# Create binary columns for each price category
for (i in 1:length(labels)) {
  czyste[paste0("in_range_", i)] <- ifelse(czyste$price_category == labels[i], 1, 0)
}

# Count occurrences and calculate percentages
price_counts <- czyste %>%
  count(price_category) %>%
  mutate(percentage = n / sum(n) * 100)  # Convert counts to percentages

# Define fixed label height (adjust based on dataset)
fixed_label_height <- max(price_counts$n) * 1.05  # Slightly above max count for visibility

# Plot the distribution of apartment prices
ggplot(price_counts, aes(x = price_category, y = n, fill = price_category)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = 2, 
            size = 5, 
            color = "black",
            nudge_y = fixed_label_height - max(price_counts$n)) +  # Fixed height
  scale_y_continuous(labels = scales::comma) +  # Format Y-axis with thousands separator
  theme_minimal() +
  labs(
    title = "Distribution of Apartment Prices",
    x = "Price Range",
    y = "Count of Apartments"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate labels for readability

The visualization shows at what prince range’s apartments in the data-set are located in. The biggest cluster of the apartments is for the apartemtns within the price range of 500,000 to 749,999 PLN with 33,9% observations being within that price range. The next two categories with the highest amounts of observations are 750,000 to 999,999 PLN at 24,7% and 250,000 to 499,999 PLN at 18,5%. Together they contribute to 77,1% of observations.

The previous hypothesis of large number of outliers is confirmed with 13,2% properties being priced at over 1,250,000 PLN. This however does not justify this large kurtosi byitself within this price range there must also be heavy outliers, to test a box plot graph is created

###Visualisation of Price Distribution by City
ggplot(czyste, aes(x = city, y = price)) +  
  geom_boxplot(aes(fill = city), alpha = 0.85) +  
  theme_minimal() +  
  labs(title = "Price Distribution by City", x = NULL, y = "Price") +   
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +  
  scale_y_continuous(labels = scales::label_number(big.mark = ".", decimal.mark = ","))

# Count occurrences and calculate percentages for the 'ownership' variable
ownership_counts <- czyste %>%
  count(ownership) %>%
  mutate(percentage = n / sum(n) * 100)  # Convert counts to percentages

# Define fixed label height (adjust based on dataset)
fixed_label_height <- max(ownership_counts$n) * 1.05  # Slightly above max count for visibility

# Plot the distribution of ownership
ggplot(ownership_counts, aes(x = ownership, y = n, fill = ownership)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = 2, 
            size = 5, 
            color = "black",
            nudge_y = fixed_label_height - max(ownership_counts$n)) +  # Fixed height
  scale_y_continuous(labels = scales::comma) +  # Format Y-axis with thousands separator
  theme_minimal() +
  labs(
    title = "Distribution of Ownership",
    x = "Ownership",
    y = "Count of Apartments"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate labels for readability

89.7% Condominium Ownership: The dominance of condominiums suggests that most people in the dataset own their housing as individual units with personal titles to the property. This aligns with a more market-oriented housing structure where private ownership is emphasized, and people typically have full control over their units.

10.3% Cooperative Ownership: A smaller percentage of people are in cooperative housing, indicating that while still significant, cooperative housing is less common. Cooperative ownership is often more tightly regulated and may limit certain freedoms, like the ability to sell or rent the unit freely.

Near 0% Udział Ownership: The negligible presence of the “udział” option suggests that this form of ownership is rare or possibly declining in the Polish housing market. This might indicate a shift away from older, collective ownership structures.

# aov and interaction plot
# Using aov to assess interaction effects between variables
aov <- aov(price ~ city * condition, data = czyste)
summary(aov)

##                   Df    Sum Sq   Mean Sq F value Pr(>F)    
## city              14 1.119e+15 7.995e+13 599.974 <2e-16 ***
## condition          1 1.372e+13 1.372e+13 102.990 <2e-16 ***
## city:condition    14 2.121e+12 1.515e+11   1.137  0.319    
## Residuals      21471 2.861e+15 1.333e+11                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Ensure 'city' is treated as a factor (if not already)
czyste$city <- factor(czyste$city)

# Interaction plot
interaction.plot(
  czyste$city,
  czyste$condition,
  czyste$price,
  main = "Interaction Plot: City x Condition",
  xlab = "City",
  ylab = "Average Price",
  col = 1:length(unique(czyste$city)), # Dynamic coloring based on number of cities
  las = 2, # Rotate labels on x-axis for better readability
  cex.axis = 0.7, # Adjust axis label size if needed
  cex.lab = 1.2 # Adjust label size if needed
)

The graph display the price of the average price of the apartment in each city according to condition of it

Heatmap Analysis

# Compute correlation matrix for selected variables
correlation_matrix <- cor(czyste %>% select(price, poiCount, squareMeters, rooms), 
                          use = "pairwise.complete.obs")

# Plot correlation heatmap for key apartment features
corrplot(correlation_matrix, method = "square", 
         tl.cex = 0.8, cl.cex = 0.8, 
         type = "upper", order = "hclust", 
         addCoef.col = "black", number.cex = 0.7,
         title = "Correlation Heatmap for Selected Variables", mar = c(0, 0, 2, 0))

# Define relevant POI variables
poi_vars <- c("schoolDistance", "clinicDistance", "postOfficeDistance", 
              "kindergartenDistance", "restaurantDistance", "collegeDistance", 
              "pharmacyDistance", "centreDistance", "price")

# Ensure all selected variables exist in the dataset
existing_poi_vars <- poi_vars[poi_vars %in% colnames(czyste)]

# Compute correlation matrix for POI distances and price
poi_correlation <- cor(czyste %>% select(all_of(existing_poi_vars)), 
                       use = "pairwise.complete.obs")

# Plot POI correlation heatmap
corrplot(poi_correlation, method = "square", 
         tl.cex = 0.8, cl.cex = 0.8, 
         type = "upper", order = "hclust", 
         addCoef.col = "black", number.cex = 0.7,
         title = "POI Distance Correlations with Price", mar = c(0, 0, 2, 0))

# Define correlation analysis function
correlation_analysis <- function(data) {
  required_columns <- c("price", "poiCount", "squareMeters", "rooms")
  
  # Ensure required columns exist
  existing_columns <- required_columns[required_columns %in% names(data)]
  
  if (length(existing_columns) < length(required_columns)) {
    stop("One or more required columns are missing from the dataset.")
  }
  
  # Compute correlation values
  cor_matrix <- cor(data[, existing_columns], use = "pairwise.complete.obs")
  
  # Create correlation heatmap
  corrplot(cor_matrix, 
           method = "color", 
           col = colorRampPalette(c("orange", "white", "navy"))(200), 
           type = "upper", 
           order = "hclust", 
           addCoef.col = "black", 
           tl.col = "black", 
           tl.srt = 45, 
           title = "Correlation Heatmap")
}

# Execute correlation analysis
correlation_analysis(czyste)

The heat map analysis of correlations between different variables results in 3 graphs. 1. The first graph display the correlation between selected variables such as latitude, price, squareMeters, rooms floor, floor count, longitude, build year and centreDistance little to no correlation between most of them. 2. The second heat map display the correlation between the apartment and distances to different amenities showing that typically if apartment is close to one amenities or infrastructure it is also close to another. What’s interesting however is negative correlation between price and clinicDistance and restarurantDistance indicating that the smaller the distance the higher the price of the apartment which is especially unusual considering positive correlation between price and centreDistance which shows that the closer to the center apartment is the lower the price. This results go against intuitive hypothesis that apartments are the most expensive at the city center. This might mean that data has a great deal of apartments in suburban areas that are worth more than the urban apartment. 3. The third graph displays a heatmap that masseurs price relative to number of rooms and size of the apartment. The natural hypothesis that the larger the apartments and the more room it has the more expensive it is holds true with very high positive correlation.

Next the graph is created to analyse the influence of on what floor apartment it’s price.

ggplot(czyste, aes(x = floor)) +
  geom_histogram(binwidth = 1, fill = "blue", alpha = 0.7) +
  theme_minimal() +
  scale_x_continuous(breaks = seq(min(czyste$floor, na.rm = TRUE), 
                                  max(czyste$floor, na.rm = TRUE), 
                                  by = 1))

  labs(title = "Distribution of Apartments by Floor", x = "Floor", y = "Count")

## $x
## [1] "Floor"
## 
## $y
## [1] "Count"
## 
## $title
## [1] "Distribution of Apartments by Floor"
## 
## attr(,"class")
## [1] "labels"

The graph shows that most apartments are on the lower floors and that generally there are more buildings that have less than 5 floors, suggesting that most of the buildings are rather small.

The floor analysis can be better understood through analysis of the building types within the population

# Calculate the count and percentage of each building type
type_summary <- czyste %>%
  count(type) %>%
  mutate(percentage = n / sum(n) * 100)

# Create a bar plot to show the count and percentage of each building type with fixed bar height
ggplot(type_summary, aes(x = type, y = n)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black", width = 0.7) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = 0.5, hjust = 0.5, size = 5, color = "black") +  # Center the percentage
  theme_minimal() +
  labs(title = "Distribution of Building Types", x = "Building Type", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate the average price for each building type
average_price_by_type <- czyste %>%
  group_by(type) %>%
  summarise(avg_price = mean(price, na.rm = TRUE))

Block of flats dominate at 62,5%, Apartment builings at 20,2% and tenements at 17,3%. The significant amount of tenement, which typically are small and not very high contribute greatly to the apartaments in the population being situated typically on the lower floors.

Next a graph is created to display the effect of Floor on price.

# Create scatter plot with a linear trend line
ggplot(czyste, aes(x = floor, y = price)) +
  geom_point(alpha = 0.5, color = "blue") +  # Scatter plot points
  geom_smooth(method = "lm", color = "red", se = FALSE) +  # Linear trend line
  labs(
    title = "Price vs. Floor",
    x = "Floor",
    y = "Price (PLN)"
  ) +
  theme_minimal()

The graph shows relatively insignificant and weak positive correlation between price and floor. Suggesting that the floor at which apartment is on does not effect it’s price significantly

Next Building material and security infrastructure is analysed to determine their effect on price of the building.

ggplot(czyste, aes(x = buildingMaterial, y = price, fill = buildingMaterial)) +
  geom_boxplot(alpha = 0.8) +
  theme_minimal() +
  coord_flip() +
  labs(title = "Impact of Building Material on Price", x = "Building Material", y = "Price")

#Impact of Security on Apartment Prices
ggplot(czyste, aes(x = factor(hasSecurity, labels = c("No Security", "Has Security")), y = price)) +
  geom_boxplot(aes(fill = factor(hasSecurity)), alpha = 0.8) +
  theme_minimal() +
  labs(title = "Impact of Security on Apartment Prices", x = "Security", y = "Price") +
  scale_fill_manual(values = c("red", "green"))

Data shows that generally apartments made of brick are more expensive than that made of concrete and that apartments with security are more expensive than that without it, however in both cases there is a great deal of outliers in the population, meaning that there is no strong correlation between the two. What’s important for analysis however is that density of the outliers for apartments without security suggesting that there are very expensive apartments that do not have security.

ggplot(czyste, aes(x = centreDistance, y = price)) +
  geom_point(aes(color = city), alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  theme_minimal() +
  labs(title = "Effect of Distance from Center on Price", x = "Distance from City Center (km)", y = "Price")

# Visualizations  
ggplot(czyste, aes(x = squareMeters, y = price)) +  
  geom_point(aes(color = city, shape = ownership), size = 2) +  
  theme_minimal()

The first graph shows the correlation between price and distance from city center across different polish cities.

The second visualization display the price against the size of the apartment depending on the type of ownership and city it is from. The most significant conclusion is the high cost of warsaw apartments and low cost of Bydgoszcz apartments. The most expensive mall apartments are situated in polish capital as expected however the biggest least expensive apartments are present around many polish cities with most notable being Bydgodsz, Katowice and Lodz

Findings

General data

The analysis of the polish real estate market using data from june 2024 Most of the Polish apartments at 77,1% are at the price range from 250.000 to 1.000.000 PLN The price distribution is highly right-skewed, indicating a large number of high-priced apartments. Over 13% of properties exceed 1,250,000 PLN, confirming the presence of high-end real estate.

City where the apartment is located is the most significant factor for it’s price

Box-plots show significant price variability between cities. Apartment condition does not strongly affect price, as shown by the violin plot. The ANOVA analysis of price by city and condition confirms that price differences are more dependent on the city rather than the condition of the apartment.

City distance

The negative correlation between price and clinic/restaurant distances suggests that high-value apartments are located in areas with better urban infrastructure. Unexpected result: Apartments further from the center are more expensive, possibly due to the presence of luxury suburban properties in gated communities. The data set may be skewed by high-end suburban developments rather than traditional urban apartments.

Final Conclusions

Polish real estate market is dominated by mid-range price apartments (500,000 - 999,999 PLN). Outliers exist in the luxury market, significantly affecting statistical measures. Private ownership dominates (nearly 90%), with limited cooperative housing. Apartment condition has little impact on price, suggesting location matters more. Proximity to clinics and restaurants is positively associated with price, while city-center distance has an unexpected negative correlation. The unexpected center-distance effect suggests that high-priced properties may be concentrated in suburban, high-end developments rather than traditional city centers.

Raport

Michał Łochowski, Paweł Górdak, Dawid Drawc

02.02.2025