1.Introduction

The real estate market in Baku, Azerbaijan, represents a dynamic and evolving sector, shaped by rapid urban development, diverse economic conditions, and shifting demographic patterns. With the growing demand for housing in the city, understanding the factors that influence apartment prices is crucial for urban planners, investors, and policymakers. This project aims to explore these complexities by using econometric analysis to identify and quantify the key drivers of apartment prices in Baku.

Specifically, this research will address four key questions:

1. How do proximity to key amenities (such as metro stations) and infrastructure impact apartment prices in Baku? This question explores the correlation between the accessibility of essential amenities and property values, offering insights into urban planning and development strategies.

2. What is the effect of floor level on the pricing of apartments within high-rise buildings in Baku’s urban landscape? This will investigate whether higher floors in high-rise buildings command a premium, potentially due to better views, less noise, or other perceived benefits.

3. Are there significant price differences between newly constructed and older apartments in Baku after accounting for location and square footage? This question examines the impact of a building’s age on its market value, considering how newer constructions with modern amenities might influence pricing.

4. Does the availability of mortgages increase the market value of apartments in Baku, and if so, by how much compared to properties without mortgage options? This will explore whether the option to finance apartments through mortgages plays a role in increasing their market value in Baku.

Through this analysis, we seek to uncover the roles of location, amenities, economic conditions, and other relevant variables that contribute to the fluctuation of property values in Baku. By leveraging data sourced from Kaggle and scraping information from the local real estate platform Bina.az, the project will employ advanced statistical methods to generate actionable insights.

The outcome of this research will not only enhance our understanding of the Baku housing market but also provide critical policy recommendations for more informed urban planning and development strategies. Ultimately, this project aims to offer a comprehensive model that can guide future real estate investments and improve living conditions for residents in the city.

The variables are:

1.Price: This column indicates the listed price of the property, offering insights into market trends and pricing variations.

2.Location: The “Location” column specifies the geographical details of the property, including the city, district, nearest metro stations. Location is a critical factor for real estate decision-making.

3.Rooms: This column represents the number of rooms in the property. Knowing the room count is crucial for prospective buyers or renters to assess the property’s suitability for their needs.

4.Square: The “Square” column contains information about the total area of the property in square meters. Property size is an essential factor for assessing space and value.

5.Floor: This column indicates the floor on which the property is situated. For those interested in apartments, the floor number can be a critical factor.

6.New Building: The “New Building” column contains binary values (e.g., 0 or 1) to indicate whether the property is in a newly constructed building. This information is valuable for those seeking modern or recently built properties.

7.Has Repair: This column contains binary values to indicate whether the property has undergone any repairs or renovations. Repair status can influence a buyer’s decision.

8.Has Bill of Sale: This column contain binary values to indicate whether a legal bill of sale exists for the property, ensuring the legitimacy of the transaction.

9.Has Mortgage: The “Has Mortgage” column contains binary values to indicate whether the property has an existing mortgage.

2.Data Importing

baku_apartment_data <- read.csv("C:/Users/User/Desktop/Applied Regional and Urban Economics/Final Project/BakuApartmentData.csv")
# View the first few rows of the dataset
head(baku_apartment_data)
##   X  price              location rooms square floor new_building has_repair
## 1 0 284000  Azadlıq Prospekti m.     3    140  5/12            1          1
## 2 1 355000  Şah İsmayıl Xətai m.     3    135 19/20            1          1
## 3 2 755000             Səbail r.     4    210  7/18            1          1
## 4 3 245000 Elmlər Akademiyası m.     3     86  8/10            1          1
## 5 4 350000 Elmlər Akademiyası m.     4    174 12/15            1          1
## 6 5 255000             Nizami m.     2     93 10/16            1          1
##   has_bill_of_sale has_mortgage
## 1                1            1
## 2                1            1
## 3                1            1
## 4                1            1
## 5                1            1
## 6                1            1

Importing the data, which is scraped from Bina.az within 2023 year.

Number of Column and Rows

ncol(baku_apartment_data)
## [1] 10
nrow(baku_apartment_data)
## [1] 39302

Structure

str(baku_apartment_data)
## 'data.frame':    39302 obs. of  10 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ price           : int  284000 355000 755000 245000 350000 255000 410000 235000 125000 207000 ...
##  $ location        : chr  "Azadlıq Prospekti m." "Şah İsmayıl Xətai m." "Səbail r." "Elmlər Akademiyası m." ...
##  $ rooms           : int  3 3 4 3 4 2 3 3 2 3 ...
##  $ square          : num  140 135 210 86 174 93 133 130 63 123 ...
##  $ floor           : chr  "5/12" "19/20" "7/18" "8/10" ...
##  $ new_building    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_repair      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_bill_of_sale: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_mortgage    : int  1 1 1 1 1 1 1 1 1 1 ...

Summary

summary(baku_apartment_data)
##        X             price           location             rooms       
##  Min.   :    0   Min.   :   9600   Length:39302       Min.   : 1.000  
##  1st Qu.: 9825   1st Qu.: 135000   Class :character   1st Qu.: 2.000  
##  Median :19651   Median : 187000   Mode  :character   Median : 3.000  
##  Mean   :19651   Mean   : 232232                      Mean   : 2.814  
##  3rd Qu.:29476   3rd Qu.: 277000                      3rd Qu.: 3.000  
##  Max.   :39301   Max.   :8075000                      Max.   :20.000  
##      square        floor            new_building      has_repair   
##  Min.   :  12   Length:39302       Min.   :0.0000   Min.   :0.000  
##  1st Qu.:  65   Class :character   1st Qu.:1.0000   1st Qu.:1.000  
##  Median :  94   Mode  :character   Median :1.0000   Median :1.000  
##  Mean   : 106                      Mean   :0.7559   Mean   :0.839  
##  3rd Qu.: 130                      3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :1600                      Max.   :1.0000   Max.   :1.000  
##  has_bill_of_sale  has_mortgage   
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000  
##  Mean   :0.7683   Mean   :0.3379  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

3.Data Preprocessing

Creating 2 columns from floor column

# Assuming your dataframe is named baku_apartment_data

# Split the 'floor' column into two new columns
baku_apartment_data$floor_split <- strsplit(as.character(baku_apartment_data$floor), "/")

# Create 'current_floor' and 'total_floors' columns from the split data
baku_apartment_data$current_floor <- sapply(baku_apartment_data$floor_split, function(x) x[1])
baku_apartment_data$total_floors <- sapply(baku_apartment_data$floor_split, function(x) x[2])

# Optionally, convert these new columns to integer type if necessary
baku_apartment_data$current_floor <- as.integer(baku_apartment_data$current_floor)
baku_apartment_data$total_floors <- as.integer(baku_apartment_data$total_floors)



baku_apartment_data <- baku_apartment_data %>% 
  select(-floor, -floor_split) %>% 
  rename(floor = current_floor)
# View the updated dataframe to check the new columns
head(baku_apartment_data)
##   X  price              location rooms square new_building has_repair
## 1 0 284000  Azadlıq Prospekti m.     3    140            1          1
## 2 1 355000  Şah İsmayıl Xətai m.     3    135            1          1
## 3 2 755000             Səbail r.     4    210            1          1
## 4 3 245000 Elmlər Akademiyası m.     3     86            1          1
## 5 4 350000 Elmlər Akademiyası m.     4    174            1          1
## 6 5 255000             Nizami m.     2     93            1          1
##   has_bill_of_sale has_mortgage floor total_floors
## 1                1            1     5           12
## 2                1            1    19           20
## 3                1            1     7           18
## 4                1            1     8           10
## 5                1            1    12           15
## 6                1            1    10           16
#Load the required library
library(dplyr)

# Calculate the number of unique values for each column in the dataset
unique_counts <- sapply(baku_apartment_data, function(x) length(unique(x)))

# Print the number of unique values for each column
print(unique_counts)
##                X            price         location            rooms 
##            39302             1845              111               16 
##           square     new_building       has_repair has_bill_of_sale 
##             1100                2                2                2 
##     has_mortgage            floor     total_floors 
##                2               27               33
# Converting binary categorical variables to factors
baku_apartment_data$new_building <- factor(baku_apartment_data$new_building)
baku_apartment_data$has_repair <- factor(baku_apartment_data$has_repair)
baku_apartment_data$has_bill_of_sale <- factor(baku_apartment_data$has_bill_of_sale)
baku_apartment_data$has_mortgage <- factor(baku_apartment_data$has_mortgage)

# Optionally converting location if treated as categorical
baku_apartment_data$location <- factor(baku_apartment_data$location)

baku_apartment_data <- baku_apartment_data %>% 
  select(-X)
# Checking the structure to verify changes
str(baku_apartment_data)
## 'data.frame':    39302 obs. of  10 variables:
##  $ price           : int  284000 355000 755000 245000 350000 255000 410000 235000 125000 207000 ...
##  $ location        : Factor w/ 111 levels "1-ci mikrorayon q.",..: 18 86 91 38 38 73 16 48 80 43 ...
##  $ rooms           : int  3 3 4 3 4 2 3 3 2 3 ...
##  $ square          : num  140 135 210 86 174 93 133 130 63 123 ...
##  $ new_building    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_repair      : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_bill_of_sale: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_mortgage    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ floor           : int  5 19 7 8 12 10 4 9 15 10 ...
##  $ total_floors    : int  12 20 18 10 15 16 6 18 16 17 ...

Region column is created via using infromation from location variable , if the name of location ends with “m.” which shows that the apartment is close to the metro station and if it ends with q. which demonstrates suburban , if it ends with r. shows the districts.

# Load necessary library
library(dplyr)

# Define a function to categorize locations
categorize_location <- function(location) {
  if (grepl("m\\.$", location)) {
    "Close to Metro"
  } else if (grepl("q\\.$", location)) {
    "Suburban Area"
  } else if (grepl("r\\.$", location)) {
    "Residential District"
  } else {
    "Central Baku"
  }
}

# Apply the function to create a new column
baku_apartment_data$region <- sapply(baku_apartment_data$location, categorize_location)
baku_apartment_data$region <- factor(baku_apartment_data$region)
# Check the changes
table(baku_apartment_data$region)  # This will give you a summary count of each category
## 
##       Close to Metro Residential District        Suburban Area 
##                26825                 6364                 6113
# Extract unique locations from the 'location' column
unique_locations <- unique(baku_apartment_data$location)

# Print the unique location names
print(unique_locations)
##   [1] Azadlıq Prospekti m.  Şah İsmayıl Xətai m.  Səbail r.            
##   [4] Elmlər Akademiyası m. Nizami m.             Ağ şəhər q.          
##   [7] İnşaatçılar m.        Qara Qarayev m.       Həzi Aslanov m.      
##  [10] Yasamal r.            Əhmədli m.            Koroğlu m.           
##  [13] Memar Əcəmi m.        Gənclik m.            Nərimanov r.         
##  [16] Yeni Yasamal q.       8 Noyabr m.           20 Yanvar m.         
##  [19] Nəsimi m.             7-ci mikrorayon q.    28 May m.            
##  [22] Binəqədi r.           Bayıl q.              Avtovağzal m.        
##  [25] Nəriman Nərimanov m.  Xətai r.              Nəsimi r.            
##  [28] Binəqədi q.           9-cu mikrorayon q.    Əhmədli q.           
##  [31] Bakıxanov q.          Qaraçuxur q.          Neftçilər m.         
##  [34] Badamdar q.           Sabunçu r.            Həzi Aslanov q.      
##  [37] Xalqlar Dostluğu m.   Suraxanı r.           Nizami r.            
##  [40] Hövsan q.             Masazır q.            8-ci mikrorayon q.   
##  [43] 4-cü mikrorayon q.    İçəri Şəhər m.        Yeni Günəşli q.      
##  [46] Yasamal q.            Dərnəgül m.           1-ci mikrorayon q.   
##  [49] Məmmədli q.           Kubinka q.            Nardaran q.          
##  [52] Mehdiabad q.          Lökbatan q.           Sahil m.             
##  [55] Biləcəri q.           Köhnə Günəşli q.      Kürdəxanı q.         
##  [58] Bakmil m.             8-ci kilometr q.      Yeni Ramana q.       
##  [61] Ceyranbatan q.        Abşeron r.            Zığ q.               
##  [64] Buzovna q.            Çiçək q.              Massiv D q.          
##  [67] Ramana q.             Günəşli q.            Sahil q.             
##  [70] Zabrat q.             Massiv A q.           Saray q.             
##  [73] Sulutəpə q.           28 May q.             M.Ə.Rəsulzadə q.     
##  [76] Xəzər r.              Xocəsən m.            Ulduz m.             
##  [79] Şıxov q.              Xutor q.              Massiv V q.          
##  [82] 6-cı mikrorayon q.    Novxanı q.            NZS q.               
##  [85] Sabunçu q.            Maştağa q.            Qaradağ r.           
##  [88] Böyükşor q.           Bibi Heybət q.        Qala q.              
##  [91] 3-cü mikrorayon q.    Mərdəkan q.           Hökməli q.           
##  [94] Savalan q.            Digah q.              Xocəsən q.           
##  [97] Binə q.               Keşlə q.              5-ci mikrorayon q.   
## [100] Görədil q.            Pirəkəşkül q.         Massiv G q.          
## [103] Əmircan q.            Suraxanı q.           Massiv B q.          
## [106] Yeni Suraxanı q.      20-ci sahə q.         Ələt q.              
## [109] Pirallahı r.          Bülbülə q.            Şimal DRES q.        
## 111 Levels: 1-ci mikrorayon q. 20-ci sahə q. 20 Yanvar m. ... Zığ q.
library(ggplot2)

# Create a histogram of the price variable with adjusted x-axis limits
ggplot(baku_apartment_data, aes(x = price)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Histogram of Apartment Prices",
       x = "Price",
       y = "Frequency") +
  xlim(0, 2000000) +  # Adjust these limits based on your observation of where the data thins out
  theme_minimal()
## Warning: Removed 47 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

The histogram shows the distribution of apartment prices in Baku, with the majority of properties falling into lower price ranges, under 500,000. There is a noticeable frequency peak at the lower end, indicating that more affordable apartments dominate the market. As the price increases, the frequency of occurrences drops significantly, showing that higher-priced apartments are much less common. The graph also suggests the presence of outliers with very high apartment prices, though these instances are rare

ggplot(baku_apartment_data, aes(y = price)) +
  geom_boxplot(fill = "tomato", color = "black") +
  labs(title = "Boxplot of Apartment Prices", y = "Price") +
  coord_cartesian(ylim = c(0, 2000000)) +  # Adjusts view but does not remove data
  theme_minimal()

The histogram and boxplot of apartment prices in Baku reveal a market characterized predominantly by affordable housing, with the bulk of apartments priced under 500,000. This distribution is right-skewed, indicating that while most of the apartments are reasonably priced, there are a few significantly higher-priced outliers, likely representing luxury accommodations or properties in prime locations. The presence of these outliers above the upper whisker of the boxplot suggests a niche market segment that could offer lucrative investment opportunities. The data underscores a market with a strong demand for mid-range housing, while also highlighting the impact of exceptional properties on the overall pricing landscape.

# Count missing values per column
missing_counts <- sapply(baku_apartment_data, function(x) sum(is.na(x)))
print(missing_counts)
##            price         location            rooms           square 
##                0                0                0                0 
##     new_building       has_repair has_bill_of_sale     has_mortgage 
##                0                0                0                0 
##            floor     total_floors           region 
##                0                0                0
# Install and load VIM package if not already installed
if (!require(VIM)) install.packages("VIM", dependencies = TRUE)
## Loading required package: VIM
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
library(VIM)

# Visualize missing data
aggr_plot <- aggr(baku_apartment_data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE,
                  labels=names(baku_apartment_data), cex.axis=.7, gap=3,
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable Count
##             price     0
##          location     0
##             rooms     0
##            square     0
##      new_building     0
##        has_repair     0
##  has_bill_of_sale     0
##      has_mortgage     0
##             floor     0
##      total_floors     0
##            region     0

This histogram of missing data highlights the completeness of your dataset. The blue bars represent the proportion of missing data for each variable. Most of the variables, such as “price,” “location,” “rooms,” “square,” and others, have no missing data, indicated by the uniform height of the bars near the 1 mark, which suggests full data availability. However, there is a small section showing some missing data for the “has_mortgage” variable, which may need further attention before analysis to address any gaps in this information

4. EDA

summary(baku_apartment_data)
##      price                         location         rooms            square    
##  Min.   :   9600   İnşaatçılar m.      : 2834   Min.   : 1.000   Min.   :  12  
##  1st Qu.: 135000   Nəriman Nərimanov m.: 2633   1st Qu.: 2.000   1st Qu.:  65  
##  Median : 187000   Nəsimi r.           : 2254   Median : 3.000   Median :  94  
##  Mean   : 232232   Şah İsmayıl Xətai m.: 2171   Mean   : 2.814   Mean   : 106  
##  3rd Qu.: 277000   Həzi Aslanov m.     : 2113   3rd Qu.: 3.000   3rd Qu.: 130  
##  Max.   :8075000   Memar Əcəmi m.      : 1981   Max.   :20.000   Max.   :1600  
##                    (Other)             :25316                                  
##  new_building has_repair has_bill_of_sale has_mortgage     floor       
##  0: 9594      0: 6327    0: 9108          0:26020      Min.   : 1.000  
##  1:29708      1:32975    1:30194          1:13282      1st Qu.: 4.000  
##                                                        Median : 7.000  
##                                                        Mean   : 8.198  
##                                                        3rd Qu.:12.000  
##                                                        Max.   :27.000  
##                                                                        
##   total_floors                    region     
##  Min.   : 1.00   Close to Metro      :26825  
##  1st Qu.: 9.00   Residential District: 6364  
##  Median :16.00   Suburban Area       : 6113  
##  Mean   :13.86                               
##  3rd Qu.:18.00                               
##  Max.   :34.00                               
## 
# Assuming 'baku_apartment_data' is your dataframe

# Select only numeric variables
numeric_data <- baku_apartment_data[sapply(baku_apartment_data, is.numeric)]

# Compute the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")  # Handling missing values by excluding them
print(cor_matrix)
##                  price     rooms    square     floor total_floors
## price        1.0000000 0.5986930 0.8081653 0.1586334    0.2610726
## rooms        0.5986930 1.0000000 0.7711708 0.1110677    0.1226077
## square       0.8081653 0.7711708 1.0000000 0.2340331    0.3263950
## floor        0.1586334 0.1110677 0.2340331 1.0000000    0.6244251
## total_floors 0.2610726 0.1226077 0.3263950 0.6244251    1.0000000
# Install and load the corrplot package if not already installed
if (!require(corrplot)) install.packages("corrplot")
## Loading required package: corrplot
## corrplot 0.94 loaded
library(corrplot)

# Visualize the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "black", 
         number.cex = 0.7, cl.cex = 0.8)

The strong positive correlation found between the squre meter of the apartment and the number of rooms of the apartment which might cause the Multicollinearity issue in the model. However, we decided to keep the variables here in this stage and we will check again mutlicluniority with variance inflated factor scores after fitting the model and then we will make the last decision.

library(ggplot2)

# Identify numeric columns
numeric_columns <- sapply(baku_apartment_data, is.numeric)

# Loop through numeric columns to plot histograms
for (col_name in names(baku_apartment_data)[numeric_columns]) {
  print(ggplot(baku_apartment_data, aes_string(x = col_name)) +
          geom_histogram(bins = 30, fill = "blue", color = "black") +
          ggtitle(paste("Histogram of", col_name)) +
          xlab(col_name) +
          ylab("Frequency") +
          theme_minimal())
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

One of these graph ,This histogram shows the distribution of the “total_floors” variable for apartments in Baku. The data reveals several peaks at specific floor counts, indicating that certain buildings with these floor numbers are more common in the market. The most frequent values cluster around buildings with 5 to 10 floors, suggesting that mid-rise buildings dominate the Baku real estate landscape. Fewer apartments are available in buildings with very few or very high numbers of floors, as indicated by the low frequencies in these ranges.

# Identify categorical columns
categorical_columns <- sapply(baku_apartment_data, function(x) is.factor(x) || is.character(x))

# Loop through categorical columns to plot bar plots
for (col_name in names(baku_apartment_data)[categorical_columns]) {
  print(ggplot(baku_apartment_data, aes_string(x = col_name)) +
          geom_bar(fill = "tomato", color = "black") +
          ggtitle(paste("Bar Plot of", col_name)) +
          xlab(col_name) +
          ylab("Count") +
          theme_minimal() +
          theme(axis.text.x = element_text(angle = 45, hjust = 1))) # Rotate x-axis labels for readability
}

This bar plot shows the distribution of apartment listings across different regions in Baku. A large proportion of the listings are in areas Close to Metro, which accounts for over 20,000 properties. The Residential District and Suburban Area categories have significantly fewer listings, with counts much lower than the metro-adjacent region. This suggests that Baku’s real estate market is heavily concentrated around metro-accessible locations, which may reflect the high demand and convenience of living near public transportation

library(dplyr)
library(ggplot2)

# Count the frequency of each location, sort it, and select the top 10
top_locations <- baku_apartment_data %>%
  count(location) %>%
  arrange(desc(n)) %>%
  top_n(10, n)

# View the top locations
print(top_locations)
##                 location    n
## 1         İnşaatçılar m. 2834
## 2   Nəriman Nərimanov m. 2633
## 3              Nəsimi r. 2254
## 4   Şah İsmayıl Xətai m. 2171
## 5        Həzi Aslanov m. 2113
## 6         Memar Əcəmi m. 1981
## 7              28 May m. 1696
## 8  Elmlər Akademiyası m. 1675
## 9           Nərimanov r. 1587
## 10           8 Noyabr m. 1408
# Filter the main dataset to include only the top 10 locations
filtered_data <- baku_apartment_data %>%
  filter(location %in% top_locations$location)
# Plotting the bar plot for the top 10 locations
ggplot(filtered_data, aes(x = location)) +
  geom_bar(fill = "tomato", color = "black") +
  ggtitle("Top 10 Most Frequent Locations") +
  xlab("Location") +
  ylab("Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

This bar plot illustrates the top 10 most frequent apartment locations in Baku. The plot highlights key areas such as 28 May m., 8 Noyabr m., and Elmlər Akademiyası m., which have the highest number of apartment listings. These locations are likely central or well-connected areas, drawing more listings due to their accessibility. Other prominent locations such as Hazi Aslanov m. and İnşaatçılar m. also show relatively high counts, reflecting their popularity in the real estate market. This distribution underscores the preference for well-connected urban areas with metro access.

library(dplyr)
library(ggplot2)

# Recategorize the 'rooms' variable
baku_apartment_data_room <- baku_apartment_data %>%
  mutate(rooms_cat = case_when(
    rooms == 1 ~ "1",
    rooms == 2 ~ "2",
    rooms == 3 ~ "3",
    rooms == 4 ~ "4",
    rooms >= 5 ~ "4+",
    TRUE ~ as.character(rooms)  # Handles any data integrity issues
  ))

# Convert rooms_cat to a factor for better plotting
baku_apartment_data_room$rooms_cat <- as.factor(baku_apartment_data_room$rooms_cat)




library(ggplot2)
library(scales)  # Ensure scales is loaded for formatting
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Plot with black background and adjusted color settings
ggplot(baku_apartment_data_room, aes(x = square, y = price, color = rooms_cat)) +
  geom_point(alpha = 0.6) +
  scale_color_viridis_d(end = 0.9) +  # Adjust color scale for visibility on dark background
  scale_x_continuous(breaks = seq(0, 500, by = 50),  # Adjust x-axis by 50 units
                     limits = c(0, 500),  # Limit x-axis to 500
                     labels = comma) +
  scale_y_continuous(breaks = seq(0, 1e6, by = 100000),  # Adjust y-axis by 100k units
                     limits = c(0, 1e6),  # Limit y-axis to 1 million
                     labels = comma) +
  labs(title = "Price vs. Square Footage by Number of Rooms",
       x = "Square Footage",
       y = "Price",
       color = "Number of Rooms") +
  theme_minimal(base_family = "Helvetica", base_size = 14) +
  theme(
    plot.background = element_rect(fill = "black", color = "black"),
    panel.background = element_rect(fill = "black"),
    text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.ticks = element_line(color = "white"),
    legend.background = element_rect(fill = "black", color = "black"),
    legend.title = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    plot.title = element_text(hjust = 0.5, color = "white", size = 16)
  )
## Warning: Removed 254 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

This scatter plot shows the relationship between apartment prices, square footage, and the number of rooms. It reveals that as square footage increases, apartment prices tend to rise as well. The data is categorized by the number of rooms, with different colors representing apartments with 1, 2, 3, 4, and 5+ rooms. Apartments with more rooms generally have higher prices, especially those with larger square footage.

library(dplyr)

# Recategorize the 'floor' variable
baku_apartment_data_floor <- baku_apartment_data %>%
  mutate(floor_cat = case_when(
    floor <= 20 ~ as.character(floor),
    floor > 20 ~ "20+"
  ))

# Convert floor_cat to a factor for better plotting
baku_apartment_data_floor$floor_cat <- factor(baku_apartment_data_floor$floor_cat, levels = c(as.character(1:20), "20+"))







library(ggplot2)
library(scales)  # For formatting numbers

# Plotting the price vs. categorized floor with adjustments
ggplot(baku_apartment_data_floor, aes(x = floor_cat, y = price)) +
  geom_boxplot(aes(fill = floor_cat), outlier.color = "red", outlier.size = 1.5) +
  scale_y_continuous(breaks = seq(0, 1e6, by = 100000),  # Adjust y-axis by 100k units
                     limits = c(0, 1e6),  # Limit y-axis to 1 million
                     labels = comma) +
  scale_fill_viridis_d(option = "A") +  # Using a discrete color scale
  labs(title = "Price Distribution by Floor",
       x = "Floor Number",
       y = "Price") +
  theme_minimal(base_family = "Helvetica", base_size = 14) +
  theme(
    plot.background = element_rect(fill = "black", color = "black"),
    panel.background = element_rect(fill = "black"),
    text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.ticks = element_line(color = "white"),
    legend.background = element_rect(fill = "black", color = "black"),
    legend.title = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    plot.title = element_text(hjust = 0.5, color = "white", size = 16)
  )
## Warning: Removed 245 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

This boxplot visualizes the price distribution across different floor categories. The plot highlights that apartments on higher floors (e.g., floors 12 and above) generally have higher median prices, with a noticeable difference in price for apartments on higher floors compared to those on the lower floors. The whiskers show the spread of prices, and the outliers indicate that a few high-floor apartments have prices significantly above the median.

library(ggplot2)
library(scales)  # For formatting numbers

# Plotting the price vs. regions
ggplot(baku_apartment_data, aes(x = region, y = price)) +
  geom_violin(aes(fill = region), trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.color = "red", outlier.size = 1.5, color = "white") +
  scale_y_continuous(breaks = seq(0, 1e6, by = 100000),  # Adjust y-axis by 100k units
                     limits = c(0, 1e6),  # Limit y-axis to 1 million
                     labels = comma) +
  scale_fill_viridis_d(option = "A") +  # Using a discrete color scale
  labs(title = "Price Distribution by Region",
       x = "Region",
       y = "Price") +
  theme_minimal(base_family = "Helvetica", base_size = 14) +
  theme(
    plot.background = element_rect(fill = "black", color = "black"),
    panel.background = element_rect(fill = "black"),
    text = element_text(color = "white"),
    axis.title = element_text(color = "white"),
    axis.text = element_text(color = "white"),
    axis.ticks = element_line(color = "white"),
    legend.background = element_rect(fill = "black", color = "black"),
    legend.title = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    plot.title = element_text(hjust = 0.5, color = "white", size = 16)
  )
## Warning: Removed 245 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 245 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_violin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

The violin plot shows the price distribution of apartments in different regions: Close to Metro, Residential District, and Suburban Area. Apartments located close to metro stations tend to have higher prices, as indicated by the higher median prices and wider distribution. Residential district areas show a narrower range of prices, while suburban areas have the lowest prices, reflected in the more compressed distribution of values

5.FEATURE ENGINEERING

colnames(baku_apartment_data)
##  [1] "price"            "location"         "rooms"            "square"          
##  [5] "new_building"     "has_repair"       "has_bill_of_sale" "has_mortgage"    
##  [9] "floor"            "total_floors"     "region"
str(baku_apartment_data)
## 'data.frame':    39302 obs. of  11 variables:
##  $ price           : int  284000 355000 755000 245000 350000 255000 410000 235000 125000 207000 ...
##  $ location        : Factor w/ 111 levels "1-ci mikrorayon q.",..: 18 86 91 38 38 73 16 48 80 43 ...
##  $ rooms           : int  3 3 4 3 4 2 3 3 2 3 ...
##  $ square          : num  140 135 210 86 174 93 133 130 63 123 ...
##  $ new_building    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_repair      : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_bill_of_sale: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_mortgage    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ floor           : int  5 19 7 8 12 10 4 9 15 10 ...
##  $ total_floors    : int  12 20 18 10 15 16 6 18 16 17 ...
##  $ region          : Factor w/ 3 levels "Close to Metro",..: 1 1 2 1 1 1 3 1 1 1 ...
# Check for any zeros or negative values before applying log transformation
sum(baku_apartment_data$price <= 0)
## [1] 0
sum(baku_apartment_data$square <= 0)
## [1] 0
# Assuming no zero or negative values, apply the log transformation
baku_apartment_data$log_price <- log(baku_apartment_data$price)
baku_apartment_data$log_square <- log(baku_apartment_data$square)

# If there are zeros or negatives, you might need to adjust them before transformation
# For instance, adding a small constant if zeros are present:
baku_apartment_data$log_price <- log(baku_apartment_data$price + 1)
baku_apartment_data$log_square <- log(baku_apartment_data$square + 1)

We check logged variables and adopted it as new dataframe

summary(baku_apartment_data$log_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.17   11.81   12.14   12.19   12.53   15.90
summary(baku_apartment_data$log_square)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.565   4.190   4.554   4.559   4.875   7.378
# Remove the original 'price' and 'square' columns
baku_apartment_data_logged <- baku_apartment_data %>%
  select(-price, -square)

Adding 5 Interaction and 2 Polynomial Features

library(dplyr)

# Assuming baku_apartment_data_logged is your current dataset
baku_apartment_data_logged <- baku_apartment_data_logged %>%
  mutate(
    # Interaction features
    interaction_log_square_floor = log_square * floor,
    interaction_log_square_new_building = log_square * as.numeric(new_building),  # Ensure new_building is numeric
    interaction_floor_has_mortgage = floor * as.numeric(has_mortgage),  # Ensure has_mortgage is numeric
    interaction_log_square_rooms = log_square * rooms,
    interaction_floor_rooms = floor * rooms,

    # Polynomial features
    log_square_squared = log_square^2,
    floor_cubed = floor^3
  )

# Check the first few rows to confirm additions
head(baku_apartment_data_logged)
##                location rooms new_building has_repair has_bill_of_sale
## 1  Azadlıq Prospekti m.     3            1          1                1
## 2  Şah İsmayıl Xətai m.     3            1          1                1
## 3             Səbail r.     4            1          1                1
## 4 Elmlər Akademiyası m.     3            1          1                1
## 5 Elmlər Akademiyası m.     4            1          1                1
## 6             Nizami m.     2            1          1                1
##   has_mortgage floor total_floors               region log_price log_square
## 1            1     5           12       Close to Metro  12.55673   4.948760
## 2            1    19           20       Close to Metro  12.77988   4.912655
## 3            1     7           18 Residential District  13.53447   5.351858
## 4            1     8           10       Close to Metro  12.40902   4.465908
## 5            1    12           15       Close to Metro  12.76569   5.164786
## 6            1    10           16       Close to Metro  12.44902   4.543295
##   interaction_log_square_floor interaction_log_square_new_building
## 1                     24.74380                            9.897520
## 2                     93.34044                            9.825310
## 3                     37.46301                           10.703716
## 4                     35.72726                            8.931816
## 5                     61.97743                           10.329572
## 6                     45.43295                            9.086590
##   interaction_floor_has_mortgage interaction_log_square_rooms
## 1                             10                     14.84628
## 2                             38                     14.73796
## 3                             14                     21.40743
## 4                             16                     13.39772
## 5                             24                     20.65914
## 6                             20                      9.08659
##   interaction_floor_rooms log_square_squared floor_cubed
## 1                      15           24.49022         125
## 2                      57           24.13418        6859
## 3                      28           28.64239         343
## 4                      24           19.94434         512
## 5                      48           26.67501        1728
## 6                      20           20.64153        1000
ncol(baku_apartment_data_logged)
## [1] 18
nrow(baku_apartment_data_logged)
## [1] 39302
library(dplyr)

# Convert factor levels in the dataset
baku_apartment_data_logged <- baku_apartment_data_logged %>%
  mutate(
    new_building = recode(new_building, `1` = "Yes", `0` = "No"),
    has_repair = recode(has_repair, `1` = "Yes", `0` = "No"),
    has_bill_of_sale = recode(has_bill_of_sale, `1` = "Yes", `0` = "No"),
    has_mortgage = recode(has_mortgage, `1` = "Yes", `0` = "No")
  )

# Check the updated factor levels
head(baku_apartment_data_logged)
##                location rooms new_building has_repair has_bill_of_sale
## 1  Azadlıq Prospekti m.     3          Yes        Yes              Yes
## 2  Şah İsmayıl Xətai m.     3          Yes        Yes              Yes
## 3             Səbail r.     4          Yes        Yes              Yes
## 4 Elmlər Akademiyası m.     3          Yes        Yes              Yes
## 5 Elmlər Akademiyası m.     4          Yes        Yes              Yes
## 6             Nizami m.     2          Yes        Yes              Yes
##   has_mortgage floor total_floors               region log_price log_square
## 1          Yes     5           12       Close to Metro  12.55673   4.948760
## 2          Yes    19           20       Close to Metro  12.77988   4.912655
## 3          Yes     7           18 Residential District  13.53447   5.351858
## 4          Yes     8           10       Close to Metro  12.40902   4.465908
## 5          Yes    12           15       Close to Metro  12.76569   5.164786
## 6          Yes    10           16       Close to Metro  12.44902   4.543295
##   interaction_log_square_floor interaction_log_square_new_building
## 1                     24.74380                            9.897520
## 2                     93.34044                            9.825310
## 3                     37.46301                           10.703716
## 4                     35.72726                            8.931816
## 5                     61.97743                           10.329572
## 6                     45.43295                            9.086590
##   interaction_floor_has_mortgage interaction_log_square_rooms
## 1                             10                     14.84628
## 2                             38                     14.73796
## 3                             14                     21.40743
## 4                             16                     13.39772
## 5                             24                     20.65914
## 6                             20                      9.08659
##   interaction_floor_rooms log_square_squared floor_cubed
## 1                      15           24.49022         125
## 2                      57           24.13418        6859
## 3                      28           28.64239         343
## 4                      24           19.94434         512
## 5                      48           26.67501        1728
## 6                      20           20.64153        1000
top_locations <- c("İnşaatçılar m.", "Nəriman Nərimanov m.", "Nəsimi r.",
                   "Şah İsmayıl Xətai m.", "Həzi Aslanov m.", "Memar Əcəmi m.",
                   "28 May m.", "Elmlər Akademiyası m.", "Nərimanov r.", "8 Noyabr m.")
library(dplyr)

baku_apartment_data_logged <- baku_apartment_data_logged %>%
  mutate(location = if_else(location %in% top_locations, as.character(location), "Others"))

# Convert 'location' back to a factor if needed
baku_apartment_data_logged$location <- factor(baku_apartment_data_logged$location)
table(baku_apartment_data_logged$location)
## 
##             28 May m.           8 Noyabr m. Elmlər Akademiyası m. 
##                  1696                  1408                  1675 
##       Həzi Aslanov m.        İnşaatçılar m.        Memar Əcəmi m. 
##                  2113                  2834                  1981 
##  Nəriman Nərimanov m.          Nərimanov r.             Nəsimi r. 
##                  2633                  1587                  2254 
##                Others  Şah İsmayıl Xətai m. 
##                 18950                  2171
head(baku_apartment_data_logged)
##                location rooms new_building has_repair has_bill_of_sale
## 1                Others     3          Yes        Yes              Yes
## 2  Şah İsmayıl Xətai m.     3          Yes        Yes              Yes
## 3                Others     4          Yes        Yes              Yes
## 4 Elmlər Akademiyası m.     3          Yes        Yes              Yes
## 5 Elmlər Akademiyası m.     4          Yes        Yes              Yes
## 6                Others     2          Yes        Yes              Yes
##   has_mortgage floor total_floors               region log_price log_square
## 1          Yes     5           12       Close to Metro  12.55673   4.948760
## 2          Yes    19           20       Close to Metro  12.77988   4.912655
## 3          Yes     7           18 Residential District  13.53447   5.351858
## 4          Yes     8           10       Close to Metro  12.40902   4.465908
## 5          Yes    12           15       Close to Metro  12.76569   5.164786
## 6          Yes    10           16       Close to Metro  12.44902   4.543295
##   interaction_log_square_floor interaction_log_square_new_building
## 1                     24.74380                            9.897520
## 2                     93.34044                            9.825310
## 3                     37.46301                           10.703716
## 4                     35.72726                            8.931816
## 5                     61.97743                           10.329572
## 6                     45.43295                            9.086590
##   interaction_floor_has_mortgage interaction_log_square_rooms
## 1                             10                     14.84628
## 2                             38                     14.73796
## 3                             14                     21.40743
## 4                             16                     13.39772
## 5                             24                     20.65914
## 6                             20                      9.08659
##   interaction_floor_rooms log_square_squared floor_cubed
## 1                      15           24.49022         125
## 2                      57           24.13418        6859
## 3                      28           28.64239         343
## 4                      24           19.94434         512
## 5                      48           26.67501        1728
## 6                      20           20.64153        1000
library(dplyr)

baku_apartment_data_logged <- baku_apartment_data_logged %>%
  mutate(
    log_price = as.numeric(log_price),
    log_square = as.numeric(log_square),
    interaction_log_square_floor = as.numeric(interaction_log_square_floor),
    interaction_log_square_new_building = as.numeric(interaction_log_square_new_building),
    interaction_floor_has_mortgage = as.numeric(interaction_floor_has_mortgage),
    interaction_log_square_rooms = as.numeric(interaction_log_square_rooms),
    interaction_floor_rooms = as.numeric(interaction_floor_rooms),
    log_square_squared = as.numeric(log_square_squared),
    floor_cubed = as.numeric(floor_cubed)
  )

str(baku_apartment_data_logged)
## 'data.frame':    39302 obs. of  18 variables:
##  $ location                           : Factor w/ 11 levels "28 May m.","8 Noyabr m.",..: 10 11 10 3 3 10 10 5 10 4 ...
##  $ rooms                              : int  3 3 4 3 4 2 3 3 2 3 ...
##  $ new_building                       : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_repair                         : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_bill_of_sale                   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ has_mortgage                       : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ floor                              : int  5 19 7 8 12 10 4 9 15 10 ...
##  $ total_floors                       : int  12 20 18 10 15 16 6 18 16 17 ...
##  $ region                             : Factor w/ 3 levels "Close to Metro",..: 1 1 2 1 1 1 3 1 1 1 ...
##  $ log_price                          : num  12.6 12.8 13.5 12.4 12.8 ...
##  $ log_square                         : num  4.95 4.91 5.35 4.47 5.16 ...
##  $ interaction_log_square_floor       : num  24.7 93.3 37.5 35.7 62 ...
##  $ interaction_log_square_new_building: num  9.9 9.83 10.7 8.93 10.33 ...
##  $ interaction_floor_has_mortgage     : num  10 38 14 16 24 20 8 18 30 20 ...
##  $ interaction_log_square_rooms       : num  14.8 14.7 21.4 13.4 20.7 ...
##  $ interaction_floor_rooms            : num  15 57 28 24 48 20 12 27 30 30 ...
##  $ log_square_squared                 : num  24.5 24.1 28.6 19.9 26.7 ...
##  $ floor_cubed                        : num  125 6859 343 512 1728 ...
# Applying one-hot encoding to the dataset, dropping the first level
data_encoded <- model.matrix(~ . - 1, data = baku_apartment_data_logged) %>% as.data.frame()

# Convert the resulting matrix back to a dataframe
data_encoded$log_price <- baku_apartment_data_logged$log_price  # Adding the log_price back as it is the target variable

# Verify the structure of the new encoded dataset
str(data_encoded)
## 'data.frame':    39302 obs. of  29 variables:
##  $ location28 May m.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ location8 Noyabr m.                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationElmlər Akademiyası m.      : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ locationHəzi Aslanov m.            : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ locationİnşaatçılar m.             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ locationMemar Əcəmi m.             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəriman Nərimanov m.       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNərimanov r.               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəsimi r.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationOthers                     : num  1 0 1 0 0 1 1 0 1 0 ...
##  $ locationŞah İsmayıl Xətai m.       : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ rooms                              : num  3 3 4 3 4 2 3 3 2 3 ...
##  $ new_buildingYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_repairYes                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_bill_of_saleYes                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_mortgageYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ floor                              : num  5 19 7 8 12 10 4 9 15 10 ...
##  $ total_floors                       : num  12 20 18 10 15 16 6 18 16 17 ...
##  $ regionResidential District         : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ regionSuburban Area                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ log_price                          : num  12.6 12.8 13.5 12.4 12.8 ...
##  $ log_square                         : num  4.95 4.91 5.35 4.47 5.16 ...
##  $ interaction_log_square_floor       : num  24.7 93.3 37.5 35.7 62 ...
##  $ interaction_log_square_new_building: num  9.9 9.83 10.7 8.93 10.33 ...
##  $ interaction_floor_has_mortgage     : num  10 38 14 16 24 20 8 18 30 20 ...
##  $ interaction_log_square_rooms       : num  14.8 14.7 21.4 13.4 20.7 ...
##  $ interaction_floor_rooms            : num  15 57 28 24 48 20 12 27 30 30 ...
##  $ log_square_squared                 : num  24.5 24.1 28.6 19.9 26.7 ...
##  $ floor_cubed                        : num  125 6859 343 512 1728 ...
ncol(data_encoded)
## [1] 29
nrow(data_encoded)
## [1] 39302
head(data_encoded)
##   location28 May m. location8 Noyabr m. locationElmlər Akademiyası m.
## 1                 0                   0                             0
## 2                 0                   0                             0
## 3                 0                   0                             0
## 4                 0                   0                             1
## 5                 0                   0                             1
## 6                 0                   0                             0
##   locationHəzi Aslanov m. locationİnşaatçılar m. locationMemar Əcəmi m.
## 1                       0                      0                      0
## 2                       0                      0                      0
## 3                       0                      0                      0
## 4                       0                      0                      0
## 5                       0                      0                      0
## 6                       0                      0                      0
##   locationNəriman Nərimanov m. locationNərimanov r. locationNəsimi r.
## 1                            0                    0                 0
## 2                            0                    0                 0
## 3                            0                    0                 0
## 4                            0                    0                 0
## 5                            0                    0                 0
## 6                            0                    0                 0
##   locationOthers locationŞah İsmayıl Xətai m. rooms new_buildingYes
## 1              1                            0     3               1
## 2              0                            1     3               1
## 3              1                            0     4               1
## 4              0                            0     3               1
## 5              0                            0     4               1
## 6              1                            0     2               1
##   has_repairYes has_bill_of_saleYes has_mortgageYes floor total_floors
## 1             1                   1               1     5           12
## 2             1                   1               1    19           20
## 3             1                   1               1     7           18
## 4             1                   1               1     8           10
## 5             1                   1               1    12           15
## 6             1                   1               1    10           16
##   regionResidential District regionSuburban Area log_price log_square
## 1                          0                   0  12.55673   4.948760
## 2                          0                   0  12.77988   4.912655
## 3                          1                   0  13.53447   5.351858
## 4                          0                   0  12.40902   4.465908
## 5                          0                   0  12.76569   5.164786
## 6                          0                   0  12.44902   4.543295
##   interaction_log_square_floor interaction_log_square_new_building
## 1                     24.74380                            9.897520
## 2                     93.34044                            9.825310
## 3                     37.46301                           10.703716
## 4                     35.72726                            8.931816
## 5                     61.97743                           10.329572
## 6                     45.43295                            9.086590
##   interaction_floor_has_mortgage interaction_log_square_rooms
## 1                             10                     14.84628
## 2                             38                     14.73796
## 3                             14                     21.40743
## 4                             16                     13.39772
## 5                             24                     20.65914
## 6                             20                      9.08659
##   interaction_floor_rooms log_square_squared floor_cubed
## 1                      15           24.49022         125
## 2                      57           24.13418        6859
## 3                      28           28.64239         343
## 4                      24           19.94434         512
## 5                      48           26.67501        1728
## 6                      20           20.64153        1000
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
# Fit a linear model
model <- lm(log_price ~ log_square + floor + floor_cubed + log_square_squared + interaction_log_square_floor + interaction_log_square_new_building + interaction_floor_has_mortgage + interaction_log_square_rooms + interaction_floor_rooms, data = baku_apartment_data_logged)
# Calculate VIF
vif_values <- vif(model)
print(vif_values)
##                          log_square                               floor 
##                          218.801164                          216.789746 
##                         floor_cubed                  log_square_squared 
##                            5.153233                          270.258685 
##        interaction_log_square_floor interaction_log_square_new_building 
##                          393.153458                            2.932818 
##      interaction_floor_has_mortgage        interaction_log_square_rooms 
##                            3.300673                           17.281943 
##             interaction_floor_rooms 
##                           43.385990
library(dplyr)

# Updating the data_encoded dataset
data_encoded <- data_encoded %>%
  select(
    -log_square_squared,            # Remove log_square_squared
    -interaction_log_square_floor   # Remove interaction_log_square_floor
  )

# Verify the changes and structure of the updated dataset
str(data_encoded)
## 'data.frame':    39302 obs. of  27 variables:
##  $ location28 May m.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ location8 Noyabr m.                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationElmlər Akademiyası m.      : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ locationHəzi Aslanov m.            : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ locationİnşaatçılar m.             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ locationMemar Əcəmi m.             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəriman Nərimanov m.       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNərimanov r.               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəsimi r.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationOthers                     : num  1 0 1 0 0 1 1 0 1 0 ...
##  $ locationŞah İsmayıl Xətai m.       : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ rooms                              : num  3 3 4 3 4 2 3 3 2 3 ...
##  $ new_buildingYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_repairYes                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_bill_of_saleYes                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_mortgageYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ floor                              : num  5 19 7 8 12 10 4 9 15 10 ...
##  $ total_floors                       : num  12 20 18 10 15 16 6 18 16 17 ...
##  $ regionResidential District         : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ regionSuburban Area                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ log_price                          : num  12.6 12.8 13.5 12.4 12.8 ...
##  $ log_square                         : num  4.95 4.91 5.35 4.47 5.16 ...
##  $ interaction_log_square_new_building: num  9.9 9.83 10.7 8.93 10.33 ...
##  $ interaction_floor_has_mortgage     : num  10 38 14 16 24 20 8 18 30 20 ...
##  $ interaction_log_square_rooms       : num  14.8 14.7 21.4 13.4 20.7 ...
##  $ interaction_floor_rooms            : num  15 57 28 24 48 20 12 27 30 30 ...
##  $ floor_cubed                        : num  125 6859 343 512 1728 ...
# Fit a linear model with selected variables
model <- lm(log_price ~ rooms + floor + total_floors + log_square + 
            interaction_log_square_new_building + interaction_floor_has_mortgage +
            interaction_log_square_rooms + interaction_floor_rooms + floor_cubed,
            data = data_encoded)
# Load necessary library
library(car)

# Calculate and print VIF
vif_values <- vif(model)
print(vif_values)
##                               rooms                               floor 
##                           48.207924                           15.998556 
##                        total_floors                          log_square 
##                            3.041088                            8.532022 
## interaction_log_square_new_building      interaction_floor_has_mortgage 
##                            4.114478                            3.302992 
##        interaction_log_square_rooms             interaction_floor_rooms 
##                           77.606066                           15.932090 
##                         floor_cubed 
##                            4.871472
# Fit a new linear model without 'rooms' and 'interaction_log_square_rooms'
updated_model <- lm(log_price ~ floor + total_floors + log_square + 
                    interaction_log_square_new_building + interaction_floor_has_mortgage,
                    data = data_encoded)

# If needed, ensure other parts of your analysis that used 'rooms' are adjusted or reevaluated.
# Load necessary library
library(car)

# Calculate and print new VIF values
new_vif_values <- vif(updated_model)
print(new_vif_values)
##                               floor                        total_floors 
##                            3.898824                            2.969334 
##                          log_square interaction_log_square_new_building 
##                            2.053790                            3.712131 
##      interaction_floor_has_mortgage 
##                            3.297039
library(dplyr)

# Update data_encoded by removing specified variables
data_encoded <- data_encoded %>%
  select(
    -floor_cubed, 
    -interaction_floor_rooms, 
    -interaction_log_square_rooms, 
    -rooms
  )

# Verify the structure of the updated dataset
str(data_encoded)
## 'data.frame':    39302 obs. of  23 variables:
##  $ location28 May m.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ location8 Noyabr m.                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationElmlər Akademiyası m.      : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ locationHəzi Aslanov m.            : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ locationİnşaatçılar m.             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ locationMemar Əcəmi m.             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəriman Nərimanov m.       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNərimanov r.               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəsimi r.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationOthers                     : num  1 0 1 0 0 1 1 0 1 0 ...
##  $ locationŞah İsmayıl Xətai m.       : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ new_buildingYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_repairYes                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_bill_of_saleYes                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_mortgageYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ floor                              : num  5 19 7 8 12 10 4 9 15 10 ...
##  $ total_floors                       : num  12 20 18 10 15 16 6 18 16 17 ...
##  $ regionResidential District         : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ regionSuburban Area                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ log_price                          : num  12.6 12.8 13.5 12.4 12.8 ...
##  $ log_square                         : num  4.95 4.91 5.35 4.47 5.16 ...
##  $ interaction_log_square_new_building: num  9.9 9.83 10.7 8.93 10.33 ...
##  $ interaction_floor_has_mortgage     : num  10 38 14 16 24 20 8 18 30 20 ...
nrow(data_encoded)
## [1] 39302
ncol(data_encoded)
## [1] 23
library(dplyr)

# Assuming 'locationOthers' is the variable to be excluded
data_encoded <- data_encoded %>%
  select(
    -locationOthers  # Add other variables to remove here as well, e.g., -variable1, -variable2
  )

# Verify the structure of the updated dataset
str(data_encoded)
## 'data.frame':    39302 obs. of  22 variables:
##  $ location28 May m.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ location8 Noyabr m.                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationElmlər Akademiyası m.      : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ locationHəzi Aslanov m.            : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ locationİnşaatçılar m.             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ locationMemar Əcəmi m.             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəriman Nərimanov m.       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNərimanov r.               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationNəsimi r.                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ locationŞah İsmayıl Xətai m.       : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ new_buildingYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_repairYes                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_bill_of_saleYes                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ has_mortgageYes                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ floor                              : num  5 19 7 8 12 10 4 9 15 10 ...
##  $ total_floors                       : num  12 20 18 10 15 16 6 18 16 17 ...
##  $ regionResidential District         : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ regionSuburban Area                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ log_price                          : num  12.6 12.8 13.5 12.4 12.8 ...
##  $ log_square                         : num  4.95 4.91 5.35 4.47 5.16 ...
##  $ interaction_log_square_new_building: num  9.9 9.83 10.7 8.93 10.33 ...
##  $ interaction_floor_has_mortgage     : num  10 38 14 16 24 20 8 18 30 20 ...

6. Regression Modeling

Summary

# Fit the linear regression model using all predictors in data_encoded
model <- lm(log_price ~ ., data = data_encoded)

# Display the summary of the model to see the coefficients and statistics
summary(model)
## 
## Call:
## lm(formula = log_price ~ ., data = data_encoded)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72106 -0.13880 -0.00728  0.12868  2.49264 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          8.2803945  0.0302044 274.145  < 2e-16 ***
## `location28 May m.`                  0.1467243  0.0065236  22.491  < 2e-16 ***
## `location8 Noyabr m.`                0.0465046  0.0070985   6.551 5.77e-11 ***
## `locationElmlər Akademiyası m.`      0.1449670  0.0065312  22.196  < 2e-16 ***
## `locationHəzi Aslanov m.`           -0.1870247  0.0059136 -31.626  < 2e-16 ***
## `locationİnşaatçılar m.`            -0.1026770  0.0052660 -19.498  < 2e-16 ***
## `locationMemar Əcəmi m.`            -0.1056256  0.0061108 -17.285  < 2e-16 ***
## `locationNəriman Nərimanov m.`       0.1124459  0.0054052  20.803  < 2e-16 ***
## `locationNərimanov r.`               0.2062878  0.0078985  26.117  < 2e-16 ***
## `locationNəsimi r.`                  0.2267927  0.0071857  31.562  < 2e-16 ***
## `locationŞah İsmayıl Xətai m.`       0.1092030  0.0059098  18.478  < 2e-16 ***
## new_buildingYes                     -0.6201738  0.0342797 -18.092  < 2e-16 ***
## has_repairYes                        0.1320820  0.0037267  35.442  < 2e-16 ***
## has_bill_of_saleYes                  0.0046467  0.0035555   1.307  0.19125    
## has_mortgageYes                     -0.0429767  0.0053495  -8.034 9.71e-16 ***
## floor                               -0.0066657  0.0008028  -8.303  < 2e-16 ***
## total_floors                         0.0072447  0.0004085  17.734  < 2e-16 ***
## `regionResidential District`        -0.1147504  0.0054709 -20.975  < 2e-16 ***
## `regionSuburban Area`               -0.2177366  0.0040446 -53.834  < 2e-16 ***
## log_square                           0.6752652  0.0145670  46.356  < 2e-16 ***
## interaction_log_square_new_building  0.1436735  0.0078914  18.206  < 2e-16 ***
## interaction_floor_has_mortgage       0.0016222  0.0005329   3.044  0.00234 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2459 on 39280 degrees of freedom
## Multiple R-squared:  0.7954, Adjusted R-squared:  0.7953 
## F-statistic:  7271 on 21 and 39280 DF,  p-value: < 2.2e-16
# Fit the reduced model without 'has_bill_of_saleYes'
reduced_model <- lm(log_price ~ . - has_bill_of_saleYes, data = data_encoded)
# Perform ANOVA comparison
model_comparison <- anova(model, reduced_model)
print(model_comparison)
## Analysis of Variance Table
## 
## Model 1: log_price ~ `location28 May m.` + `location8 Noyabr m.` + `locationElmlər Akademiyası m.` + 
##     `locationHəzi Aslanov m.` + `locationİnşaatçılar m.` + 
##     `locationMemar Əcəmi m.` + `locationNəriman Nərimanov m.` + 
##     `locationNərimanov r.` + `locationNəsimi r.` + `locationŞah İsmayıl Xətai m.` + 
##     new_buildingYes + has_repairYes + has_bill_of_saleYes + has_mortgageYes + 
##     floor + total_floors + `regionResidential District` + `regionSuburban Area` + 
##     log_square + interaction_log_square_new_building + interaction_floor_has_mortgage
## Model 2: log_price ~ (`location28 May m.` + `location8 Noyabr m.` + `locationElmlər Akademiyası m.` + 
##     `locationHəzi Aslanov m.` + `locationİnşaatçılar m.` + 
##     `locationMemar Əcəmi m.` + `locationNəriman Nərimanov m.` + 
##     `locationNərimanov r.` + `locationNəsimi r.` + `locationŞah İsmayıl Xətai m.` + 
##     new_buildingYes + has_repairYes + has_bill_of_saleYes + has_mortgageYes + 
##     floor + total_floors + `regionResidential District` + `regionSuburban Area` + 
##     log_square + interaction_log_square_new_building + interaction_floor_has_mortgage) - 
##     has_bill_of_saleYes
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1  39280 2375.6                          
## 2  39281 2375.7 -1   -0.1033 1.708 0.1913

The p-value of 0.1913 from the ANOVA test is greater than the common significance level of 0.05, suggesting that the difference in the fit between the original model and the reduced model (without has_bill_of_saleYes) is not statistically significant. This means that removing has_bill_of_saleYes does not significantly worsen the model’s fit.

THE FINAL MODEL:

# Rename the reduced model to final_model
final_model <- reduced_model

# Display the summary of the final model
summary(final_model)
## 
## Call:
## lm(formula = log_price ~ . - has_bill_of_saleYes, data = data_encoded)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72382 -0.13860 -0.00727  0.12856  2.49202 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          8.2842176  0.0300627 275.564  < 2e-16 ***
## `location28 May m.`                  0.1467025  0.0065236  22.488  < 2e-16 ***
## `location8 Noyabr m.`                0.0468381  0.0070940   6.602 4.09e-11 ***
## `locationElmlər Akademiyası m.`      0.1452867  0.0065267  22.260  < 2e-16 ***
## `locationHəzi Aslanov m.`           -0.1869151  0.0059131 -31.610  < 2e-16 ***
## `locationİnşaatçılar m.`            -0.1024154  0.0052622 -19.462  < 2e-16 ***
## `locationMemar Əcəmi m.`            -0.1056365  0.0061109 -17.287  < 2e-16 ***
## `locationNəriman Nərimanov m.`       0.1121590  0.0054008  20.767  < 2e-16 ***
## `locationNərimanov r.`               0.2062880  0.0078985  26.117  < 2e-16 ***
## `locationNəsimi r.`                  0.2267076  0.0071855  31.551  < 2e-16 ***
## `locationŞah İsmayıl Xətai m.`       0.1091699  0.0059098  18.473  < 2e-16 ***
## new_buildingYes                     -0.6265199  0.0339343 -18.463  < 2e-16 ***
## has_repairYes                        0.1334520  0.0035763  37.316  < 2e-16 ***
## has_mortgageYes                     -0.0418576  0.0052805  -7.927 2.31e-15 ***
## floor                               -0.0066987  0.0008024  -8.348  < 2e-16 ***
## total_floors                         0.0072524  0.0004085  17.754  < 2e-16 ***
## `regionResidential District`        -0.1147580  0.0054709 -20.976  < 2e-16 ***
## `regionSuburban Area`               -0.2177396  0.0040446 -53.834  < 2e-16 ***
## log_square                           0.6739214  0.0145308  46.379  < 2e-16 ***
## interaction_log_square_new_building  0.1447786  0.0078461  18.452  < 2e-16 ***
## interaction_floor_has_mortgage       0.0016530  0.0005324   3.105  0.00191 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2459 on 39281 degrees of freedom
## Multiple R-squared:  0.7954, Adjusted R-squared:  0.7953 
## F-statistic:  7634 on 20 and 39281 DF,  p-value: < 2.2e-16

7.Assumptions

Gauss Markow 6 Golden Assumptions (BLUE)

Dataset isn’t time-series data and the observations are not inherently ordered in time or related to each other temporally, then the assumption regarding no autocorrelation of residuals (independence) typically does not require explicit testing. So 5 rules listed below:

  1. Linearity: Ensure the relationship between the predictors and the response is linear. This can be visually inspected through plots of the residuals vs. fitted values.

  2. Normality of Errors: This can be assessed with a Q-Q plot of the residuals. If the residuals deviate significantly from the line in a Q-Q plot, it may indicate non-normality, which can affect the validity of confidence intervals and hypothesis tests.

  3. Multicollinearity: Check with VIF scores as you’ve already done. Remember, high VIFs suggest that linear dependencies among the explanatory variables could be inflating the variances of the estimated coefficients.

  4. Homoscedasticity: Look for constant variance of residuals in the plot of residuals vs. fitted values. If you see patterns such as funnels or heteroscedasticity, it might violate this assumption.

  5. Endogeneity: While tricky to test directly without deeper analysis or external instruments, careful model specification and understanding the data generation process are crucial. Make sure all relevant variables are included to avoid omitted variable bias.

1.Linearity

# Ensure the 'lmtest' package is loaded for the RESET test
if (!require(lmtest)) {
  install.packages("lmtest")
}
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lmtest)

# Run Ramsey's RESET test for model specification
reset_test <- resettest(final_model)
print(reset_test)
## 
##  RESET test
## 
## data:  final_model
## RESET = 57.067, df1 = 2, df2 = 39279, p-value < 2.2e-16

In essence, the RESET test is warning that our model might not be capturing all the dynamics of the data, suggesting that further refinement is necessary to improve its accuracy and reliability.

2.Normality

# Load necessary package
if (!require(nortest)) {
  install.packages("nortest")
}
## Loading required package: nortest
library(nortest)

# Perform Lilliefors (Kolmogorov-Smirnov) test for normality
lillie_test <- lillie.test(residuals(final_model))
print(lillie_test)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  residuals(final_model)
## D = 0.055862, p-value < 2.2e-16

The Lilliefors (Kolmogorov-Smirnov) test result you provided shows a p-value significantly less than 0.05 (p-value < 2.2e-16), which strongly suggests that the residuals of our model do not follow a normal distribution. The test statistic D = 0.055862 indicates the maximum deviation between the observed cumulative distribution of residuals and the expected cumulative distribution under normality.

3.Multicollinearity Check

# Select only numeric predictors
numeric_vars <- c("floor", "total_floors", "log_square", 
                  "interaction_log_square_new_building", 
                  "interaction_floor_has_mortgage")

# Fit a new linear model using only numeric predictors
numeric_model <- lm(log_price ~ ., data = data_encoded[, c(numeric_vars, "log_price")])

# Compute VIF for only numeric predictors
vif_numeric <- vif(numeric_model)
print(vif_numeric)
##                               floor                        total_floors 
##                            3.898824                            2.969334 
##                          log_square interaction_log_square_new_building 
##                            2.053790                            3.712131 
##      interaction_floor_has_mortgage 
##                            3.297039

If VIF < 5 for all variables → Fail to reject H₀ (No significant multicollinearity, the model is stable). H0: There is no multicollinearity among the independent variables.

4.Endogeneity Check

IV regression Application

if (!require(AER)) install.packages("AER")  # IV regression
## Loading required package: AER
## Loading required package: sandwich
## Loading required package: survival
library(AER)

if (!require(lmtest)) install.packages("lmtest")  # For hypothesis testing
library(lmtest)

# First stage: regress suspected endogenous variable on instruments
first_stage <- lm(log_square ~ total_floors, data = data_encoded)

# Save residuals from first-stage regression
data_encoded$residuals_iv <- residuals(first_stage)

# Second stage: Include residuals from first-stage regression
hausman_model <- lm(log_price ~ log_square + floor + total_floors + 
                    interaction_log_square_new_building + interaction_floor_has_mortgage + 
                    residuals_iv, data = data_encoded)

# Perform t-test on residual term to check for endogeneity
summary(hausman_model)
## 
## Call:
## lm(formula = log_price ~ log_square + floor + total_floors + 
##     interaction_log_square_new_building + interaction_floor_has_mortgage + 
##     residuals_iv, data = data_encoded)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.79238 -0.15512 -0.00093  0.16533  2.52511 
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          7.6313932  0.0156546 487.487  < 2e-16 ***
## log_square                           0.9885958  0.0043166 229.023  < 2e-16 ***
## floor                               -0.0017307  0.0005576  -3.104  0.00191 ** 
## total_floors                         0.0115576  0.0004463  25.899  < 2e-16 ***
## interaction_log_square_new_building -0.0083612  0.0011527  -7.254 4.13e-13 ***
## interaction_floor_has_mortgage      -0.0023762  0.0003027  -7.849 4.30e-15 ***
## residuals_iv                                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2778 on 39296 degrees of freedom
## Multiple R-squared:  0.7388, Adjusted R-squared:  0.7388 
## F-statistic: 2.223e+04 on 5 and 39296 DF,  p-value: < 2.2e-16

The instrumented residual (residuals_iv) is perfectly collinear with other predictors, meaning that the model cannot estimate its effect separately. The suspected endogenous variable (log_square) is actually exogenous, meaning that it does not suffer from endogeneity.

Final Sumamry 1. Since residuals_iv is not estimable (NA), it does not contribute to explaining the dependent variable (log_price).

  1. This suggests no significant endogeneity problem for log_square, meaning that OLS is a valid estimation method for your model.

  2. Decision: We can continue using the OLS model confidently without needing to switch to instrumental variable (IV) regression.

5.Heteroscedasticity Check

# Install and load necessary package
if (!require(lmtest)) install.packages("lmtest")
library(lmtest)

# Perform Breusch-Pagan test
bp_test <- bptest(final_model)

# Print test result
print(bp_test)
## 
##  studentized Breusch-Pagan test
## 
## data:  final_model
## BP = 3512.6, df = 20, p-value < 2.2e-16

p-value > 0.05 → Fail to reject H₀ → No heteroscedasticity (Good). p-value < 0.05 → Reject H₀ → Heteroscedasticity detected (Problematic).

Since the p-value is extremely small (< 0.05), we reject the null hypothesis (H₀) of homoscedasticity. This means heteroscedasticity is present in the model, meaning the variance of residuals is not constant. This could lead to biased standard errors, affecting hypothesis testing (e.g., confidence intervals and p-values may not be reliable).

Robust Standard Error Application

# Install and load necessary package
if (!require(sandwich)) install.packages("sandwich")
if (!require(lmtest)) install.packages("lmtest")
library(sandwich)
library(lmtest)

# Compute robust standard errors (HC1) and re-run hypothesis tests
robust_se <- coeftest(final_model, vcov = vcovHC(final_model, type = "HC1"))

# Print results with robust standard errors
print(robust_se)
## 
## t test of coefficients:
## 
##                                        Estimate  Std. Error  t value  Pr(>|t|)
## (Intercept)                          8.28421758  0.04104606 201.8273 < 2.2e-16
## `location28 May m.`                  0.14670248  0.00598256  24.5217 < 2.2e-16
## `location8 Noyabr m.`                0.04683806  0.00545118   8.5923 < 2.2e-16
## `locationElmlər Akademiyası m.`      0.14528668  0.00596229  24.3676 < 2.2e-16
## `locationHəzi Aslanov m.`           -0.18691513  0.00411735 -45.3969 < 2.2e-16
## `locationİnşaatçılar m.`            -0.10241537  0.00448243 -22.8482 < 2.2e-16
## `locationMemar Əcəmi m.`            -0.10563652  0.00409048 -25.8249 < 2.2e-16
## `locationNəriman Nərimanov m.`       0.11215903  0.00456917  24.5469 < 2.2e-16
## `locationNərimanov r.`               0.20628803  0.00766612  26.9091 < 2.2e-16
## `locationNəsimi r.`                  0.22670757  0.00762101  29.7477 < 2.2e-16
## `locationŞah İsmayıl Xətai m.`       0.10916992  0.00559034  19.5283 < 2.2e-16
## new_buildingYes                     -0.62651987  0.04676083 -13.3984 < 2.2e-16
## has_repairYes                        0.13345199  0.00422712  31.5704 < 2.2e-16
## has_mortgageYes                     -0.04185764  0.00544506  -7.6873 1.538e-14
## floor                               -0.00669868  0.00082175  -8.1517 3.694e-16
## total_floors                         0.00725242  0.00055128  13.1556 < 2.2e-16
## `regionResidential District`        -0.11475799  0.00637638 -17.9974 < 2.2e-16
## `regionSuburban Area`               -0.21773958  0.00478879 -45.4686 < 2.2e-16
## log_square                           0.67392139  0.02055820  32.7811 < 2.2e-16
## interaction_log_square_new_building  0.14477857  0.01094386  13.2292 < 2.2e-16
## interaction_floor_has_mortgage       0.00165295  0.00052434   3.1524   0.00162
##                                        
## (Intercept)                         ***
## `location28 May m.`                 ***
## `location8 Noyabr m.`               ***
## `locationElmlər Akademiyası m.`     ***
## `locationHəzi Aslanov m.`           ***
## `locationİnşaatçılar m.`            ***
## `locationMemar Əcəmi m.`            ***
## `locationNəriman Nərimanov m.`      ***
## `locationNərimanov r.`              ***
## `locationNəsimi r.`                 ***
## `locationŞah İsmayıl Xətai m.`      ***
## new_buildingYes                     ***
## has_repairYes                       ***
## has_mortgageYes                     ***
## floor                               ***
## total_floors                        ***
## `regionResidential District`        ***
## `regionSuburban Area`               ***
## log_square                          ***
## interaction_log_square_new_building ***
## interaction_floor_has_mortgage      ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fixes the issue without changing the model. Makes standard errors robust to heteroscedasticity. Ensures valid hypothesis testing even when heteroscedasticity exists.

# Print coefficients with robust standard errors
print(robust_se)
## 
## t test of coefficients:
## 
##                                        Estimate  Std. Error  t value  Pr(>|t|)
## (Intercept)                          8.28421758  0.04104606 201.8273 < 2.2e-16
## `location28 May m.`                  0.14670248  0.00598256  24.5217 < 2.2e-16
## `location8 Noyabr m.`                0.04683806  0.00545118   8.5923 < 2.2e-16
## `locationElmlər Akademiyası m.`      0.14528668  0.00596229  24.3676 < 2.2e-16
## `locationHəzi Aslanov m.`           -0.18691513  0.00411735 -45.3969 < 2.2e-16
## `locationİnşaatçılar m.`            -0.10241537  0.00448243 -22.8482 < 2.2e-16
## `locationMemar Əcəmi m.`            -0.10563652  0.00409048 -25.8249 < 2.2e-16
## `locationNəriman Nərimanov m.`       0.11215903  0.00456917  24.5469 < 2.2e-16
## `locationNərimanov r.`               0.20628803  0.00766612  26.9091 < 2.2e-16
## `locationNəsimi r.`                  0.22670757  0.00762101  29.7477 < 2.2e-16
## `locationŞah İsmayıl Xətai m.`       0.10916992  0.00559034  19.5283 < 2.2e-16
## new_buildingYes                     -0.62651987  0.04676083 -13.3984 < 2.2e-16
## has_repairYes                        0.13345199  0.00422712  31.5704 < 2.2e-16
## has_mortgageYes                     -0.04185764  0.00544506  -7.6873 1.538e-14
## floor                               -0.00669868  0.00082175  -8.1517 3.694e-16
## total_floors                         0.00725242  0.00055128  13.1556 < 2.2e-16
## `regionResidential District`        -0.11475799  0.00637638 -17.9974 < 2.2e-16
## `regionSuburban Area`               -0.21773958  0.00478879 -45.4686 < 2.2e-16
## log_square                           0.67392139  0.02055820  32.7811 < 2.2e-16
## interaction_log_square_new_building  0.14477857  0.01094386  13.2292 < 2.2e-16
## interaction_floor_has_mortgage       0.00165295  0.00052434   3.1524   0.00162
##                                        
## (Intercept)                         ***
## `location28 May m.`                 ***
## `location8 Noyabr m.`               ***
## `locationElmlər Akademiyası m.`     ***
## `locationHəzi Aslanov m.`           ***
## `locationİnşaatçılar m.`            ***
## `locationMemar Əcəmi m.`            ***
## `locationNəriman Nərimanov m.`      ***
## `locationNərimanov r.`              ***
## `locationNəsimi r.`                 ***
## `locationŞah İsmayıl Xətai m.`      ***
## new_buildingYes                     ***
## has_repairYes                       ***
## has_mortgageYes                     ***
## floor                               ***
## total_floors                        ***
## `regionResidential District`        ***
## `regionSuburban Area`               ***
## log_square                          ***
## interaction_log_square_new_building ***
## interaction_floor_has_mortgage      ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using the original final_model for predictions and coefficients. Using the robust standard errors (HC1) when interpreting p-values and confidence intervals.

In our analysis, we systematically checked five key assumptions of linear regression, excluding autocorrelation since our data is not time-series. Multicollinearity and endogeneity were not problematic, as our VIF values were within acceptable limits, and the Durbin-Wu-Hausman test did not indicate bias in our OLS estimates. However, we detected heteroscedasticity using the Breusch-Pagan test, which we successfully addressed by applying robust standard errors (HC1) to ensure valid hypothesis testing. Linearity and normality assumptions were not fully met, as indicated by the RESET test and Q-Q plots, likely due to the complexity of housing price determinants, such as interactions and nonlinear effects that are difficult to capture in a strict linear model. Despite this, the Central Limit Theorem (CLT) assures that with a large sample size (39,302 observations), our coefficient estimates remain asymptotically normal, making hypothesis tests and confidence intervals still reliable. Therefore, even with minor deviations from normality and linearity, our final model remains valid and provides meaningful insights into the factors affecting apartment prices in Baku. 🚀
print(robust_se)
## 
## t test of coefficients:
## 
##                                        Estimate  Std. Error  t value  Pr(>|t|)
## (Intercept)                          8.28421758  0.04104606 201.8273 < 2.2e-16
## `location28 May m.`                  0.14670248  0.00598256  24.5217 < 2.2e-16
## `location8 Noyabr m.`                0.04683806  0.00545118   8.5923 < 2.2e-16
## `locationElmlər Akademiyası m.`      0.14528668  0.00596229  24.3676 < 2.2e-16
## `locationHəzi Aslanov m.`           -0.18691513  0.00411735 -45.3969 < 2.2e-16
## `locationİnşaatçılar m.`            -0.10241537  0.00448243 -22.8482 < 2.2e-16
## `locationMemar Əcəmi m.`            -0.10563652  0.00409048 -25.8249 < 2.2e-16
## `locationNəriman Nərimanov m.`       0.11215903  0.00456917  24.5469 < 2.2e-16
## `locationNərimanov r.`               0.20628803  0.00766612  26.9091 < 2.2e-16
## `locationNəsimi r.`                  0.22670757  0.00762101  29.7477 < 2.2e-16
## `locationŞah İsmayıl Xətai m.`       0.10916992  0.00559034  19.5283 < 2.2e-16
## new_buildingYes                     -0.62651987  0.04676083 -13.3984 < 2.2e-16
## has_repairYes                        0.13345199  0.00422712  31.5704 < 2.2e-16
## has_mortgageYes                     -0.04185764  0.00544506  -7.6873 1.538e-14
## floor                               -0.00669868  0.00082175  -8.1517 3.694e-16
## total_floors                         0.00725242  0.00055128  13.1556 < 2.2e-16
## `regionResidential District`        -0.11475799  0.00637638 -17.9974 < 2.2e-16
## `regionSuburban Area`               -0.21773958  0.00478879 -45.4686 < 2.2e-16
## log_square                           0.67392139  0.02055820  32.7811 < 2.2e-16
## interaction_log_square_new_building  0.14477857  0.01094386  13.2292 < 2.2e-16
## interaction_floor_has_mortgage       0.00165295  0.00052434   3.1524   0.00162
##                                        
## (Intercept)                         ***
## `location28 May m.`                 ***
## `location8 Noyabr m.`               ***
## `locationElmlər Akademiyası m.`     ***
## `locationHəzi Aslanov m.`           ***
## `locationİnşaatçılar m.`            ***
## `locationMemar Əcəmi m.`            ***
## `locationNəriman Nərimanov m.`      ***
## `locationNərimanov r.`              ***
## `locationNəsimi r.`                 ***
## `locationŞah İsmayıl Xətai m.`      ***
## new_buildingYes                     ***
## has_repairYes                       ***
## has_mortgageYes                     ***
## floor                               ***
## total_floors                        ***
## `regionResidential District`        ***
## `regionSuburban Area`               ***
## log_square                          ***
## interaction_log_square_new_building ***
## interaction_floor_has_mortgage      ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.Results and Discussion

This study aimed to examine the key factors influencing apartment prices in Baku, focusing on location, infrastructure, building characteristics, mortgage availability, and floor level. Using a log-log model, the estimated coefficients represent elasticities, indicating the percentage change in price for a 1% change in a continuous variable, while categorical variables reflect relative price differences compared to a reference category.

1. Effect of Proximity to Key Amenities (Metro Stations and Infrastructure) The results show that proximity to metro stations significantly impacts apartment prices. Apartments located near major metro stations like 28 May m. (+14.7%), Elmlər Akademiyası m. (+14.5%), and Nərimanov r. (+20.6%) experience price premiums, reinforcing the importance of public transportation accessibility. Conversely, Həzi Aslanov m. (-18.7%) and İnşaatçılar m. (-10.2%) show price discounts, suggesting that some metro areas might not be as desirable, potentially due to congestion or lower surrounding infrastructure quality.

2. Floor Level and Apartment Prices in High-Rise Buildings The coefficient for floor (-0.67%) suggests that, holding all else constant, each additional floor decreases apartment price by 0.67%. This contradicts the assumption that higher floors are always more desirable, indicating that in Baku’s market, buyers might prefer lower floors due to elevator accessibility, fire safety concerns, or cultural preferences. However, the total number of floors in the building has a positive effect (+0.73%), suggesting that taller buildings generally command higher prices, likely due to better amenities, security, or architectural quality.

3. Price Differences Between New and Older Apartments Surprisingly, new buildings are associated with a significant price discount of -62.7%, after controlling for other factors. This suggests that new apartments might be smaller, located in less central areas, or have yet to establish market demand. However, interaction effects with log_square (+14.5%) indicate that larger apartments in new buildings still command a premium, emphasizing the role of size and modern features in valuation.

4. Impact of Mortgage Availability on Property Prices The availability of mortgages is negatively associated with price (-4.2%), suggesting that mortgage-backed properties tend to be slightly cheaper. This could be due to sellers being more flexible with pricing when buyers rely on financing, or mortgaged properties being concentrated in areas with lower baseline prices. However, the interaction between floor and mortgage availability (+0.16%) suggests that in some cases, higher-floor apartments with mortgage options see a slight price boost, possibly due to their desirability for investment purposes.

9.Conclusion

The study confirms that location, infrastructure, and building characteristics are major determinants of apartment prices in Baku. While proximity to metro stations increases property values, higher floors surprisingly decrease prices, and new constructions sell at a discount despite their modern features. Mortgage availability does not seem to significantly drive prices upward, suggesting that other economic or market constraints may play a role. These findings provide valuable insights for urban planners, real estate developers, and policymakers in optimizing housing policies and infrastructure planning.