The real estate market in Baku, Azerbaijan, represents a dynamic and evolving sector, shaped by rapid urban development, diverse economic conditions, and shifting demographic patterns. With the growing demand for housing in the city, understanding the factors that influence apartment prices is crucial for urban planners, investors, and policymakers. This project aims to explore these complexities by using econometric analysis to identify and quantify the key drivers of apartment prices in Baku.
Specifically, this research will address four key questions:
1. How do proximity to key amenities (such as metro stations) and infrastructure impact apartment prices in Baku? This question explores the correlation between the accessibility of essential amenities and property values, offering insights into urban planning and development strategies.
2. What is the effect of floor level on the pricing of apartments within high-rise buildings in Baku’s urban landscape? This will investigate whether higher floors in high-rise buildings command a premium, potentially due to better views, less noise, or other perceived benefits.
3. Are there significant price differences between newly constructed and older apartments in Baku after accounting for location and square footage? This question examines the impact of a building’s age on its market value, considering how newer constructions with modern amenities might influence pricing.
4. Does the availability of mortgages increase the market value of apartments in Baku, and if so, by how much compared to properties without mortgage options? This will explore whether the option to finance apartments through mortgages plays a role in increasing their market value in Baku.
Through this analysis, we seek to uncover the roles of location, amenities, economic conditions, and other relevant variables that contribute to the fluctuation of property values in Baku. By leveraging data sourced from Kaggle and scraping information from the local real estate platform Bina.az, the project will employ advanced statistical methods to generate actionable insights.
The outcome of this research will not only enhance our understanding of the Baku housing market but also provide critical policy recommendations for more informed urban planning and development strategies. Ultimately, this project aims to offer a comprehensive model that can guide future real estate investments and improve living conditions for residents in the city.
The variables are:
1.Price: This column indicates the listed price of the property, offering insights into market trends and pricing variations.
2.Location: The “Location” column specifies the geographical details of the property, including the city, district, nearest metro stations. Location is a critical factor for real estate decision-making.
3.Rooms: This column represents the number of rooms in the property. Knowing the room count is crucial for prospective buyers or renters to assess the property’s suitability for their needs.
4.Square: The “Square” column contains information about the total area of the property in square meters. Property size is an essential factor for assessing space and value.
5.Floor: This column indicates the floor on which the property is situated. For those interested in apartments, the floor number can be a critical factor.
6.New Building: The “New Building” column contains binary values (e.g., 0 or 1) to indicate whether the property is in a newly constructed building. This information is valuable for those seeking modern or recently built properties.
7.Has Repair: This column contains binary values to indicate whether the property has undergone any repairs or renovations. Repair status can influence a buyer’s decision.
8.Has Bill of Sale: This column contain binary values to indicate whether a legal bill of sale exists for the property, ensuring the legitimacy of the transaction.
9.Has Mortgage: The “Has Mortgage” column contains binary values to indicate whether the property has an existing mortgage.
baku_apartment_data <- read.csv("C:/Users/User/Desktop/Applied Regional and Urban Economics/Final Project/BakuApartmentData.csv")
# View the first few rows of the dataset
head(baku_apartment_data)
## X price location rooms square floor new_building has_repair
## 1 0 284000 Azadlıq Prospekti m. 3 140 5/12 1 1
## 2 1 355000 Şah İsmayıl Xətai m. 3 135 19/20 1 1
## 3 2 755000 Səbail r. 4 210 7/18 1 1
## 4 3 245000 Elmlər Akademiyası m. 3 86 8/10 1 1
## 5 4 350000 Elmlər Akademiyası m. 4 174 12/15 1 1
## 6 5 255000 Nizami m. 2 93 10/16 1 1
## has_bill_of_sale has_mortgage
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 1
Importing the data, which is scraped from Bina.az within 2023 year.
Number of Column and Rows
ncol(baku_apartment_data)
## [1] 10
nrow(baku_apartment_data)
## [1] 39302
Structure
str(baku_apartment_data)
## 'data.frame': 39302 obs. of 10 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ price : int 284000 355000 755000 245000 350000 255000 410000 235000 125000 207000 ...
## $ location : chr "Azadlıq Prospekti m." "Şah İsmayıl Xətai m." "Səbail r." "Elmlər Akademiyası m." ...
## $ rooms : int 3 3 4 3 4 2 3 3 2 3 ...
## $ square : num 140 135 210 86 174 93 133 130 63 123 ...
## $ floor : chr "5/12" "19/20" "7/18" "8/10" ...
## $ new_building : int 1 1 1 1 1 1 1 1 1 1 ...
## $ has_repair : int 1 1 1 1 1 1 1 1 1 1 ...
## $ has_bill_of_sale: int 1 1 1 1 1 1 1 1 1 1 ...
## $ has_mortgage : int 1 1 1 1 1 1 1 1 1 1 ...
Summary
summary(baku_apartment_data)
## X price location rooms
## Min. : 0 Min. : 9600 Length:39302 Min. : 1.000
## 1st Qu.: 9825 1st Qu.: 135000 Class :character 1st Qu.: 2.000
## Median :19651 Median : 187000 Mode :character Median : 3.000
## Mean :19651 Mean : 232232 Mean : 2.814
## 3rd Qu.:29476 3rd Qu.: 277000 3rd Qu.: 3.000
## Max. :39301 Max. :8075000 Max. :20.000
## square floor new_building has_repair
## Min. : 12 Length:39302 Min. :0.0000 Min. :0.000
## 1st Qu.: 65 Class :character 1st Qu.:1.0000 1st Qu.:1.000
## Median : 94 Mode :character Median :1.0000 Median :1.000
## Mean : 106 Mean :0.7559 Mean :0.839
## 3rd Qu.: 130 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1600 Max. :1.0000 Max. :1.000
## has_bill_of_sale has_mortgage
## Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000
## Mean :0.7683 Mean :0.3379
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
Creating 2 columns from floor column
# Assuming your dataframe is named baku_apartment_data
# Split the 'floor' column into two new columns
baku_apartment_data$floor_split <- strsplit(as.character(baku_apartment_data$floor), "/")
# Create 'current_floor' and 'total_floors' columns from the split data
baku_apartment_data$current_floor <- sapply(baku_apartment_data$floor_split, function(x) x[1])
baku_apartment_data$total_floors <- sapply(baku_apartment_data$floor_split, function(x) x[2])
# Optionally, convert these new columns to integer type if necessary
baku_apartment_data$current_floor <- as.integer(baku_apartment_data$current_floor)
baku_apartment_data$total_floors <- as.integer(baku_apartment_data$total_floors)
baku_apartment_data <- baku_apartment_data %>%
select(-floor, -floor_split) %>%
rename(floor = current_floor)
# View the updated dataframe to check the new columns
head(baku_apartment_data)
## X price location rooms square new_building has_repair
## 1 0 284000 Azadlıq Prospekti m. 3 140 1 1
## 2 1 355000 Şah İsmayıl Xətai m. 3 135 1 1
## 3 2 755000 Səbail r. 4 210 1 1
## 4 3 245000 Elmlər Akademiyası m. 3 86 1 1
## 5 4 350000 Elmlər Akademiyası m. 4 174 1 1
## 6 5 255000 Nizami m. 2 93 1 1
## has_bill_of_sale has_mortgage floor total_floors
## 1 1 1 5 12
## 2 1 1 19 20
## 3 1 1 7 18
## 4 1 1 8 10
## 5 1 1 12 15
## 6 1 1 10 16
#Load the required library
library(dplyr)
# Calculate the number of unique values for each column in the dataset
unique_counts <- sapply(baku_apartment_data, function(x) length(unique(x)))
# Print the number of unique values for each column
print(unique_counts)
## X price location rooms
## 39302 1845 111 16
## square new_building has_repair has_bill_of_sale
## 1100 2 2 2
## has_mortgage floor total_floors
## 2 27 33
# Converting binary categorical variables to factors
baku_apartment_data$new_building <- factor(baku_apartment_data$new_building)
baku_apartment_data$has_repair <- factor(baku_apartment_data$has_repair)
baku_apartment_data$has_bill_of_sale <- factor(baku_apartment_data$has_bill_of_sale)
baku_apartment_data$has_mortgage <- factor(baku_apartment_data$has_mortgage)
# Optionally converting location if treated as categorical
baku_apartment_data$location <- factor(baku_apartment_data$location)
baku_apartment_data <- baku_apartment_data %>%
select(-X)
# Checking the structure to verify changes
str(baku_apartment_data)
## 'data.frame': 39302 obs. of 10 variables:
## $ price : int 284000 355000 755000 245000 350000 255000 410000 235000 125000 207000 ...
## $ location : Factor w/ 111 levels "1-ci mikrorayon q.",..: 18 86 91 38 38 73 16 48 80 43 ...
## $ rooms : int 3 3 4 3 4 2 3 3 2 3 ...
## $ square : num 140 135 210 86 174 93 133 130 63 123 ...
## $ new_building : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_repair : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_bill_of_sale: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_mortgage : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ floor : int 5 19 7 8 12 10 4 9 15 10 ...
## $ total_floors : int 12 20 18 10 15 16 6 18 16 17 ...
Region column is created via using infromation from location variable , if the name of location ends with “m.” which shows that the apartment is close to the metro station and if it ends with q. which demonstrates suburban , if it ends with r. shows the districts.
# Load necessary library
library(dplyr)
# Define a function to categorize locations
categorize_location <- function(location) {
if (grepl("m\\.$", location)) {
"Close to Metro"
} else if (grepl("q\\.$", location)) {
"Suburban Area"
} else if (grepl("r\\.$", location)) {
"Residential District"
} else {
"Central Baku"
}
}
# Apply the function to create a new column
baku_apartment_data$region <- sapply(baku_apartment_data$location, categorize_location)
baku_apartment_data$region <- factor(baku_apartment_data$region)
# Check the changes
table(baku_apartment_data$region) # This will give you a summary count of each category
##
## Close to Metro Residential District Suburban Area
## 26825 6364 6113
# Extract unique locations from the 'location' column
unique_locations <- unique(baku_apartment_data$location)
# Print the unique location names
print(unique_locations)
## [1] Azadlıq Prospekti m. Şah İsmayıl Xətai m. Səbail r.
## [4] Elmlər Akademiyası m. Nizami m. Ağ şəhər q.
## [7] İnşaatçılar m. Qara Qarayev m. Həzi Aslanov m.
## [10] Yasamal r. Əhmədli m. Koroğlu m.
## [13] Memar Əcəmi m. Gənclik m. Nərimanov r.
## [16] Yeni Yasamal q. 8 Noyabr m. 20 Yanvar m.
## [19] Nəsimi m. 7-ci mikrorayon q. 28 May m.
## [22] Binəqədi r. Bayıl q. Avtovağzal m.
## [25] Nəriman Nərimanov m. Xətai r. Nəsimi r.
## [28] Binəqədi q. 9-cu mikrorayon q. Əhmədli q.
## [31] Bakıxanov q. Qaraçuxur q. Neftçilər m.
## [34] Badamdar q. Sabunçu r. Həzi Aslanov q.
## [37] Xalqlar Dostluğu m. Suraxanı r. Nizami r.
## [40] Hövsan q. Masazır q. 8-ci mikrorayon q.
## [43] 4-cü mikrorayon q. İçəri Şəhər m. Yeni Günəşli q.
## [46] Yasamal q. Dərnəgül m. 1-ci mikrorayon q.
## [49] Məmmədli q. Kubinka q. Nardaran q.
## [52] Mehdiabad q. Lökbatan q. Sahil m.
## [55] Biləcəri q. Köhnə Günəşli q. Kürdəxanı q.
## [58] Bakmil m. 8-ci kilometr q. Yeni Ramana q.
## [61] Ceyranbatan q. Abşeron r. Zığ q.
## [64] Buzovna q. Çiçək q. Massiv D q.
## [67] Ramana q. Günəşli q. Sahil q.
## [70] Zabrat q. Massiv A q. Saray q.
## [73] Sulutəpə q. 28 May q. M.Ə.Rəsulzadə q.
## [76] Xəzər r. Xocəsən m. Ulduz m.
## [79] Şıxov q. Xutor q. Massiv V q.
## [82] 6-cı mikrorayon q. Novxanı q. NZS q.
## [85] Sabunçu q. Maştağa q. Qaradağ r.
## [88] Böyükşor q. Bibi Heybət q. Qala q.
## [91] 3-cü mikrorayon q. Mərdəkan q. Hökməli q.
## [94] Savalan q. Digah q. Xocəsən q.
## [97] Binə q. Keşlə q. 5-ci mikrorayon q.
## [100] Görədil q. Pirəkəşkül q. Massiv G q.
## [103] Əmircan q. Suraxanı q. Massiv B q.
## [106] Yeni Suraxanı q. 20-ci sahə q. Ələt q.
## [109] Pirallahı r. Bülbülə q. Şimal DRES q.
## 111 Levels: 1-ci mikrorayon q. 20-ci sahə q. 20 Yanvar m. ... Zığ q.
library(ggplot2)
# Create a histogram of the price variable with adjusted x-axis limits
ggplot(baku_apartment_data, aes(x = price)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(title = "Histogram of Apartment Prices",
x = "Price",
y = "Frequency") +
xlim(0, 2000000) + # Adjust these limits based on your observation of where the data thins out
theme_minimal()
## Warning: Removed 47 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
The histogram shows the distribution of apartment prices in Baku, with
the majority of properties falling into lower price ranges, under
500,000. There is a noticeable frequency peak at the lower end,
indicating that more affordable apartments dominate the market. As the
price increases, the frequency of occurrences drops significantly,
showing that higher-priced apartments are much less common. The graph
also suggests the presence of outliers with very high apartment prices,
though these instances are rare
ggplot(baku_apartment_data, aes(y = price)) +
geom_boxplot(fill = "tomato", color = "black") +
labs(title = "Boxplot of Apartment Prices", y = "Price") +
coord_cartesian(ylim = c(0, 2000000)) + # Adjusts view but does not remove data
theme_minimal()
The histogram and boxplot of apartment prices in Baku reveal a market characterized predominantly by affordable housing, with the bulk of apartments priced under 500,000. This distribution is right-skewed, indicating that while most of the apartments are reasonably priced, there are a few significantly higher-priced outliers, likely representing luxury accommodations or properties in prime locations. The presence of these outliers above the upper whisker of the boxplot suggests a niche market segment that could offer lucrative investment opportunities. The data underscores a market with a strong demand for mid-range housing, while also highlighting the impact of exceptional properties on the overall pricing landscape.
# Count missing values per column
missing_counts <- sapply(baku_apartment_data, function(x) sum(is.na(x)))
print(missing_counts)
## price location rooms square
## 0 0 0 0
## new_building has_repair has_bill_of_sale has_mortgage
## 0 0 0 0
## floor total_floors region
## 0 0 0
# Install and load VIM package if not already installed
if (!require(VIM)) install.packages("VIM", dependencies = TRUE)
## Loading required package: VIM
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(VIM)
# Visualize missing data
aggr_plot <- aggr(baku_apartment_data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE,
labels=names(baku_apartment_data), cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## price 0
## location 0
## rooms 0
## square 0
## new_building 0
## has_repair 0
## has_bill_of_sale 0
## has_mortgage 0
## floor 0
## total_floors 0
## region 0
This histogram of missing data highlights the completeness of your dataset. The blue bars represent the proportion of missing data for each variable. Most of the variables, such as “price,” “location,” “rooms,” “square,” and others, have no missing data, indicated by the uniform height of the bars near the 1 mark, which suggests full data availability. However, there is a small section showing some missing data for the “has_mortgage” variable, which may need further attention before analysis to address any gaps in this information
summary(baku_apartment_data)
## price location rooms square
## Min. : 9600 İnşaatçılar m. : 2834 Min. : 1.000 Min. : 12
## 1st Qu.: 135000 Nəriman Nərimanov m.: 2633 1st Qu.: 2.000 1st Qu.: 65
## Median : 187000 Nəsimi r. : 2254 Median : 3.000 Median : 94
## Mean : 232232 Şah İsmayıl Xətai m.: 2171 Mean : 2.814 Mean : 106
## 3rd Qu.: 277000 Həzi Aslanov m. : 2113 3rd Qu.: 3.000 3rd Qu.: 130
## Max. :8075000 Memar Əcəmi m. : 1981 Max. :20.000 Max. :1600
## (Other) :25316
## new_building has_repair has_bill_of_sale has_mortgage floor
## 0: 9594 0: 6327 0: 9108 0:26020 Min. : 1.000
## 1:29708 1:32975 1:30194 1:13282 1st Qu.: 4.000
## Median : 7.000
## Mean : 8.198
## 3rd Qu.:12.000
## Max. :27.000
##
## total_floors region
## Min. : 1.00 Close to Metro :26825
## 1st Qu.: 9.00 Residential District: 6364
## Median :16.00 Suburban Area : 6113
## Mean :13.86
## 3rd Qu.:18.00
## Max. :34.00
##
# Assuming 'baku_apartment_data' is your dataframe
# Select only numeric variables
numeric_data <- baku_apartment_data[sapply(baku_apartment_data, is.numeric)]
# Compute the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs") # Handling missing values by excluding them
print(cor_matrix)
## price rooms square floor total_floors
## price 1.0000000 0.5986930 0.8081653 0.1586334 0.2610726
## rooms 0.5986930 1.0000000 0.7711708 0.1110677 0.1226077
## square 0.8081653 0.7711708 1.0000000 0.2340331 0.3263950
## floor 0.1586334 0.1110677 0.2340331 1.0000000 0.6244251
## total_floors 0.2610726 0.1226077 0.3263950 0.6244251 1.0000000
# Install and load the corrplot package if not already installed
if (!require(corrplot)) install.packages("corrplot")
## Loading required package: corrplot
## corrplot 0.94 loaded
library(corrplot)
# Visualize the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
number.cex = 0.7, cl.cex = 0.8)
The strong positive correlation found between the squre meter of the apartment and the number of rooms of the apartment which might cause the Multicollinearity issue in the model. However, we decided to keep the variables here in this stage and we will check again mutlicluniority with variance inflated factor scores after fitting the model and then we will make the last decision.
library(ggplot2)
# Identify numeric columns
numeric_columns <- sapply(baku_apartment_data, is.numeric)
# Loop through numeric columns to plot histograms
for (col_name in names(baku_apartment_data)[numeric_columns]) {
print(ggplot(baku_apartment_data, aes_string(x = col_name)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
ggtitle(paste("Histogram of", col_name)) +
xlab(col_name) +
ylab("Frequency") +
theme_minimal())
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
One of these graph ,This histogram shows the distribution of the
“total_floors” variable for apartments in Baku. The data reveals several
peaks at specific floor counts, indicating that certain buildings with
these floor numbers are more common in the market. The most frequent
values cluster around buildings with 5 to 10 floors, suggesting that
mid-rise buildings dominate the Baku real estate landscape. Fewer
apartments are available in buildings with very few or very high numbers
of floors, as indicated by the low frequencies in these ranges.
# Identify categorical columns
categorical_columns <- sapply(baku_apartment_data, function(x) is.factor(x) || is.character(x))
# Loop through categorical columns to plot bar plots
for (col_name in names(baku_apartment_data)[categorical_columns]) {
print(ggplot(baku_apartment_data, aes_string(x = col_name)) +
geom_bar(fill = "tomato", color = "black") +
ggtitle(paste("Bar Plot of", col_name)) +
xlab(col_name) +
ylab("Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))) # Rotate x-axis labels for readability
}
This bar plot shows the distribution of apartment listings across
different regions in Baku. A large proportion of the listings are in
areas Close to Metro, which accounts for over 20,000 properties. The
Residential District and Suburban Area categories have significantly
fewer listings, with counts much lower than the metro-adjacent region.
This suggests that Baku’s real estate market is heavily concentrated
around metro-accessible locations, which may reflect the high demand and
convenience of living near public transportation
library(dplyr)
library(ggplot2)
# Count the frequency of each location, sort it, and select the top 10
top_locations <- baku_apartment_data %>%
count(location) %>%
arrange(desc(n)) %>%
top_n(10, n)
# View the top locations
print(top_locations)
## location n
## 1 İnşaatçılar m. 2834
## 2 Nəriman Nərimanov m. 2633
## 3 Nəsimi r. 2254
## 4 Şah İsmayıl Xətai m. 2171
## 5 Həzi Aslanov m. 2113
## 6 Memar Əcəmi m. 1981
## 7 28 May m. 1696
## 8 Elmlər Akademiyası m. 1675
## 9 Nərimanov r. 1587
## 10 8 Noyabr m. 1408
# Filter the main dataset to include only the top 10 locations
filtered_data <- baku_apartment_data %>%
filter(location %in% top_locations$location)
# Plotting the bar plot for the top 10 locations
ggplot(filtered_data, aes(x = location)) +
geom_bar(fill = "tomato", color = "black") +
ggtitle("Top 10 Most Frequent Locations") +
xlab("Location") +
ylab("Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
This bar plot illustrates the top 10 most frequent apartment locations in Baku. The plot highlights key areas such as 28 May m., 8 Noyabr m., and Elmlər Akademiyası m., which have the highest number of apartment listings. These locations are likely central or well-connected areas, drawing more listings due to their accessibility. Other prominent locations such as Hazi Aslanov m. and İnşaatçılar m. also show relatively high counts, reflecting their popularity in the real estate market. This distribution underscores the preference for well-connected urban areas with metro access.
library(dplyr)
library(ggplot2)
# Recategorize the 'rooms' variable
baku_apartment_data_room <- baku_apartment_data %>%
mutate(rooms_cat = case_when(
rooms == 1 ~ "1",
rooms == 2 ~ "2",
rooms == 3 ~ "3",
rooms == 4 ~ "4",
rooms >= 5 ~ "4+",
TRUE ~ as.character(rooms) # Handles any data integrity issues
))
# Convert rooms_cat to a factor for better plotting
baku_apartment_data_room$rooms_cat <- as.factor(baku_apartment_data_room$rooms_cat)
library(ggplot2)
library(scales) # Ensure scales is loaded for formatting
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Plot with black background and adjusted color settings
ggplot(baku_apartment_data_room, aes(x = square, y = price, color = rooms_cat)) +
geom_point(alpha = 0.6) +
scale_color_viridis_d(end = 0.9) + # Adjust color scale for visibility on dark background
scale_x_continuous(breaks = seq(0, 500, by = 50), # Adjust x-axis by 50 units
limits = c(0, 500), # Limit x-axis to 500
labels = comma) +
scale_y_continuous(breaks = seq(0, 1e6, by = 100000), # Adjust y-axis by 100k units
limits = c(0, 1e6), # Limit y-axis to 1 million
labels = comma) +
labs(title = "Price vs. Square Footage by Number of Rooms",
x = "Square Footage",
y = "Price",
color = "Number of Rooms") +
theme_minimal(base_family = "Helvetica", base_size = 14) +
theme(
plot.background = element_rect(fill = "black", color = "black"),
panel.background = element_rect(fill = "black"),
text = element_text(color = "white"),
axis.title = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.ticks = element_line(color = "white"),
legend.background = element_rect(fill = "black", color = "black"),
legend.title = element_text(color = "white"),
legend.text = element_text(color = "white"),
plot.title = element_text(hjust = 0.5, color = "white", size = 16)
)
## Warning: Removed 254 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
This scatter plot shows the relationship between apartment prices, square footage, and the number of rooms. It reveals that as square footage increases, apartment prices tend to rise as well. The data is categorized by the number of rooms, with different colors representing apartments with 1, 2, 3, 4, and 5+ rooms. Apartments with more rooms generally have higher prices, especially those with larger square footage.
library(dplyr)
# Recategorize the 'floor' variable
baku_apartment_data_floor <- baku_apartment_data %>%
mutate(floor_cat = case_when(
floor <= 20 ~ as.character(floor),
floor > 20 ~ "20+"
))
# Convert floor_cat to a factor for better plotting
baku_apartment_data_floor$floor_cat <- factor(baku_apartment_data_floor$floor_cat, levels = c(as.character(1:20), "20+"))
library(ggplot2)
library(scales) # For formatting numbers
# Plotting the price vs. categorized floor with adjustments
ggplot(baku_apartment_data_floor, aes(x = floor_cat, y = price)) +
geom_boxplot(aes(fill = floor_cat), outlier.color = "red", outlier.size = 1.5) +
scale_y_continuous(breaks = seq(0, 1e6, by = 100000), # Adjust y-axis by 100k units
limits = c(0, 1e6), # Limit y-axis to 1 million
labels = comma) +
scale_fill_viridis_d(option = "A") + # Using a discrete color scale
labs(title = "Price Distribution by Floor",
x = "Floor Number",
y = "Price") +
theme_minimal(base_family = "Helvetica", base_size = 14) +
theme(
plot.background = element_rect(fill = "black", color = "black"),
panel.background = element_rect(fill = "black"),
text = element_text(color = "white"),
axis.title = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.ticks = element_line(color = "white"),
legend.background = element_rect(fill = "black", color = "black"),
legend.title = element_text(color = "white"),
legend.text = element_text(color = "white"),
plot.title = element_text(hjust = 0.5, color = "white", size = 16)
)
## Warning: Removed 245 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
This boxplot visualizes the price distribution across different floor categories. The plot highlights that apartments on higher floors (e.g., floors 12 and above) generally have higher median prices, with a noticeable difference in price for apartments on higher floors compared to those on the lower floors. The whiskers show the spread of prices, and the outliers indicate that a few high-floor apartments have prices significantly above the median.
library(ggplot2)
library(scales) # For formatting numbers
# Plotting the price vs. regions
ggplot(baku_apartment_data, aes(x = region, y = price)) +
geom_violin(aes(fill = region), trim = FALSE) +
geom_boxplot(width = 0.1, outlier.color = "red", outlier.size = 1.5, color = "white") +
scale_y_continuous(breaks = seq(0, 1e6, by = 100000), # Adjust y-axis by 100k units
limits = c(0, 1e6), # Limit y-axis to 1 million
labels = comma) +
scale_fill_viridis_d(option = "A") + # Using a discrete color scale
labs(title = "Price Distribution by Region",
x = "Region",
y = "Price") +
theme_minimal(base_family = "Helvetica", base_size = 14) +
theme(
plot.background = element_rect(fill = "black", color = "black"),
panel.background = element_rect(fill = "black"),
text = element_text(color = "white"),
axis.title = element_text(color = "white"),
axis.text = element_text(color = "white"),
axis.ticks = element_line(color = "white"),
legend.background = element_rect(fill = "black", color = "black"),
legend.title = element_text(color = "white"),
legend.text = element_text(color = "white"),
plot.title = element_text(hjust = 0.5, color = "white", size = 16)
)
## Warning: Removed 245 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 245 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_violin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
The violin plot shows the price distribution of apartments in different regions: Close to Metro, Residential District, and Suburban Area. Apartments located close to metro stations tend to have higher prices, as indicated by the higher median prices and wider distribution. Residential district areas show a narrower range of prices, while suburban areas have the lowest prices, reflected in the more compressed distribution of values
colnames(baku_apartment_data)
## [1] "price" "location" "rooms" "square"
## [5] "new_building" "has_repair" "has_bill_of_sale" "has_mortgage"
## [9] "floor" "total_floors" "region"
str(baku_apartment_data)
## 'data.frame': 39302 obs. of 11 variables:
## $ price : int 284000 355000 755000 245000 350000 255000 410000 235000 125000 207000 ...
## $ location : Factor w/ 111 levels "1-ci mikrorayon q.",..: 18 86 91 38 38 73 16 48 80 43 ...
## $ rooms : int 3 3 4 3 4 2 3 3 2 3 ...
## $ square : num 140 135 210 86 174 93 133 130 63 123 ...
## $ new_building : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_repair : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_bill_of_sale: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_mortgage : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ floor : int 5 19 7 8 12 10 4 9 15 10 ...
## $ total_floors : int 12 20 18 10 15 16 6 18 16 17 ...
## $ region : Factor w/ 3 levels "Close to Metro",..: 1 1 2 1 1 1 3 1 1 1 ...
# Check for any zeros or negative values before applying log transformation
sum(baku_apartment_data$price <= 0)
## [1] 0
sum(baku_apartment_data$square <= 0)
## [1] 0
# Assuming no zero or negative values, apply the log transformation
baku_apartment_data$log_price <- log(baku_apartment_data$price)
baku_apartment_data$log_square <- log(baku_apartment_data$square)
# If there are zeros or negatives, you might need to adjust them before transformation
# For instance, adding a small constant if zeros are present:
baku_apartment_data$log_price <- log(baku_apartment_data$price + 1)
baku_apartment_data$log_square <- log(baku_apartment_data$square + 1)
We check logged variables and adopted it as new dataframe
summary(baku_apartment_data$log_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.17 11.81 12.14 12.19 12.53 15.90
summary(baku_apartment_data$log_square)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.565 4.190 4.554 4.559 4.875 7.378
# Remove the original 'price' and 'square' columns
baku_apartment_data_logged <- baku_apartment_data %>%
select(-price, -square)
Adding 5 Interaction and 2 Polynomial Features
library(dplyr)
# Assuming baku_apartment_data_logged is your current dataset
baku_apartment_data_logged <- baku_apartment_data_logged %>%
mutate(
# Interaction features
interaction_log_square_floor = log_square * floor,
interaction_log_square_new_building = log_square * as.numeric(new_building), # Ensure new_building is numeric
interaction_floor_has_mortgage = floor * as.numeric(has_mortgage), # Ensure has_mortgage is numeric
interaction_log_square_rooms = log_square * rooms,
interaction_floor_rooms = floor * rooms,
# Polynomial features
log_square_squared = log_square^2,
floor_cubed = floor^3
)
# Check the first few rows to confirm additions
head(baku_apartment_data_logged)
## location rooms new_building has_repair has_bill_of_sale
## 1 Azadlıq Prospekti m. 3 1 1 1
## 2 Şah İsmayıl Xətai m. 3 1 1 1
## 3 Səbail r. 4 1 1 1
## 4 Elmlər Akademiyası m. 3 1 1 1
## 5 Elmlər Akademiyası m. 4 1 1 1
## 6 Nizami m. 2 1 1 1
## has_mortgage floor total_floors region log_price log_square
## 1 1 5 12 Close to Metro 12.55673 4.948760
## 2 1 19 20 Close to Metro 12.77988 4.912655
## 3 1 7 18 Residential District 13.53447 5.351858
## 4 1 8 10 Close to Metro 12.40902 4.465908
## 5 1 12 15 Close to Metro 12.76569 5.164786
## 6 1 10 16 Close to Metro 12.44902 4.543295
## interaction_log_square_floor interaction_log_square_new_building
## 1 24.74380 9.897520
## 2 93.34044 9.825310
## 3 37.46301 10.703716
## 4 35.72726 8.931816
## 5 61.97743 10.329572
## 6 45.43295 9.086590
## interaction_floor_has_mortgage interaction_log_square_rooms
## 1 10 14.84628
## 2 38 14.73796
## 3 14 21.40743
## 4 16 13.39772
## 5 24 20.65914
## 6 20 9.08659
## interaction_floor_rooms log_square_squared floor_cubed
## 1 15 24.49022 125
## 2 57 24.13418 6859
## 3 28 28.64239 343
## 4 24 19.94434 512
## 5 48 26.67501 1728
## 6 20 20.64153 1000
ncol(baku_apartment_data_logged)
## [1] 18
nrow(baku_apartment_data_logged)
## [1] 39302
library(dplyr)
# Convert factor levels in the dataset
baku_apartment_data_logged <- baku_apartment_data_logged %>%
mutate(
new_building = recode(new_building, `1` = "Yes", `0` = "No"),
has_repair = recode(has_repair, `1` = "Yes", `0` = "No"),
has_bill_of_sale = recode(has_bill_of_sale, `1` = "Yes", `0` = "No"),
has_mortgage = recode(has_mortgage, `1` = "Yes", `0` = "No")
)
# Check the updated factor levels
head(baku_apartment_data_logged)
## location rooms new_building has_repair has_bill_of_sale
## 1 Azadlıq Prospekti m. 3 Yes Yes Yes
## 2 Şah İsmayıl Xətai m. 3 Yes Yes Yes
## 3 Səbail r. 4 Yes Yes Yes
## 4 Elmlər Akademiyası m. 3 Yes Yes Yes
## 5 Elmlər Akademiyası m. 4 Yes Yes Yes
## 6 Nizami m. 2 Yes Yes Yes
## has_mortgage floor total_floors region log_price log_square
## 1 Yes 5 12 Close to Metro 12.55673 4.948760
## 2 Yes 19 20 Close to Metro 12.77988 4.912655
## 3 Yes 7 18 Residential District 13.53447 5.351858
## 4 Yes 8 10 Close to Metro 12.40902 4.465908
## 5 Yes 12 15 Close to Metro 12.76569 5.164786
## 6 Yes 10 16 Close to Metro 12.44902 4.543295
## interaction_log_square_floor interaction_log_square_new_building
## 1 24.74380 9.897520
## 2 93.34044 9.825310
## 3 37.46301 10.703716
## 4 35.72726 8.931816
## 5 61.97743 10.329572
## 6 45.43295 9.086590
## interaction_floor_has_mortgage interaction_log_square_rooms
## 1 10 14.84628
## 2 38 14.73796
## 3 14 21.40743
## 4 16 13.39772
## 5 24 20.65914
## 6 20 9.08659
## interaction_floor_rooms log_square_squared floor_cubed
## 1 15 24.49022 125
## 2 57 24.13418 6859
## 3 28 28.64239 343
## 4 24 19.94434 512
## 5 48 26.67501 1728
## 6 20 20.64153 1000
top_locations <- c("İnşaatçılar m.", "Nəriman Nərimanov m.", "Nəsimi r.",
"Şah İsmayıl Xətai m.", "Həzi Aslanov m.", "Memar Əcəmi m.",
"28 May m.", "Elmlər Akademiyası m.", "Nərimanov r.", "8 Noyabr m.")
library(dplyr)
baku_apartment_data_logged <- baku_apartment_data_logged %>%
mutate(location = if_else(location %in% top_locations, as.character(location), "Others"))
# Convert 'location' back to a factor if needed
baku_apartment_data_logged$location <- factor(baku_apartment_data_logged$location)
table(baku_apartment_data_logged$location)
##
## 28 May m. 8 Noyabr m. Elmlər Akademiyası m.
## 1696 1408 1675
## Həzi Aslanov m. İnşaatçılar m. Memar Əcəmi m.
## 2113 2834 1981
## Nəriman Nərimanov m. Nərimanov r. Nəsimi r.
## 2633 1587 2254
## Others Şah İsmayıl Xətai m.
## 18950 2171
head(baku_apartment_data_logged)
## location rooms new_building has_repair has_bill_of_sale
## 1 Others 3 Yes Yes Yes
## 2 Şah İsmayıl Xətai m. 3 Yes Yes Yes
## 3 Others 4 Yes Yes Yes
## 4 Elmlər Akademiyası m. 3 Yes Yes Yes
## 5 Elmlər Akademiyası m. 4 Yes Yes Yes
## 6 Others 2 Yes Yes Yes
## has_mortgage floor total_floors region log_price log_square
## 1 Yes 5 12 Close to Metro 12.55673 4.948760
## 2 Yes 19 20 Close to Metro 12.77988 4.912655
## 3 Yes 7 18 Residential District 13.53447 5.351858
## 4 Yes 8 10 Close to Metro 12.40902 4.465908
## 5 Yes 12 15 Close to Metro 12.76569 5.164786
## 6 Yes 10 16 Close to Metro 12.44902 4.543295
## interaction_log_square_floor interaction_log_square_new_building
## 1 24.74380 9.897520
## 2 93.34044 9.825310
## 3 37.46301 10.703716
## 4 35.72726 8.931816
## 5 61.97743 10.329572
## 6 45.43295 9.086590
## interaction_floor_has_mortgage interaction_log_square_rooms
## 1 10 14.84628
## 2 38 14.73796
## 3 14 21.40743
## 4 16 13.39772
## 5 24 20.65914
## 6 20 9.08659
## interaction_floor_rooms log_square_squared floor_cubed
## 1 15 24.49022 125
## 2 57 24.13418 6859
## 3 28 28.64239 343
## 4 24 19.94434 512
## 5 48 26.67501 1728
## 6 20 20.64153 1000
library(dplyr)
baku_apartment_data_logged <- baku_apartment_data_logged %>%
mutate(
log_price = as.numeric(log_price),
log_square = as.numeric(log_square),
interaction_log_square_floor = as.numeric(interaction_log_square_floor),
interaction_log_square_new_building = as.numeric(interaction_log_square_new_building),
interaction_floor_has_mortgage = as.numeric(interaction_floor_has_mortgage),
interaction_log_square_rooms = as.numeric(interaction_log_square_rooms),
interaction_floor_rooms = as.numeric(interaction_floor_rooms),
log_square_squared = as.numeric(log_square_squared),
floor_cubed = as.numeric(floor_cubed)
)
str(baku_apartment_data_logged)
## 'data.frame': 39302 obs. of 18 variables:
## $ location : Factor w/ 11 levels "28 May m.","8 Noyabr m.",..: 10 11 10 3 3 10 10 5 10 4 ...
## $ rooms : int 3 3 4 3 4 2 3 3 2 3 ...
## $ new_building : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_repair : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_bill_of_sale : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ has_mortgage : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ floor : int 5 19 7 8 12 10 4 9 15 10 ...
## $ total_floors : int 12 20 18 10 15 16 6 18 16 17 ...
## $ region : Factor w/ 3 levels "Close to Metro",..: 1 1 2 1 1 1 3 1 1 1 ...
## $ log_price : num 12.6 12.8 13.5 12.4 12.8 ...
## $ log_square : num 4.95 4.91 5.35 4.47 5.16 ...
## $ interaction_log_square_floor : num 24.7 93.3 37.5 35.7 62 ...
## $ interaction_log_square_new_building: num 9.9 9.83 10.7 8.93 10.33 ...
## $ interaction_floor_has_mortgage : num 10 38 14 16 24 20 8 18 30 20 ...
## $ interaction_log_square_rooms : num 14.8 14.7 21.4 13.4 20.7 ...
## $ interaction_floor_rooms : num 15 57 28 24 48 20 12 27 30 30 ...
## $ log_square_squared : num 24.5 24.1 28.6 19.9 26.7 ...
## $ floor_cubed : num 125 6859 343 512 1728 ...
# Applying one-hot encoding to the dataset, dropping the first level
data_encoded <- model.matrix(~ . - 1, data = baku_apartment_data_logged) %>% as.data.frame()
# Convert the resulting matrix back to a dataframe
data_encoded$log_price <- baku_apartment_data_logged$log_price # Adding the log_price back as it is the target variable
# Verify the structure of the new encoded dataset
str(data_encoded)
## 'data.frame': 39302 obs. of 29 variables:
## $ location28 May m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ location8 Noyabr m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationElmlər Akademiyası m. : num 0 0 0 1 1 0 0 0 0 0 ...
## $ locationHəzi Aslanov m. : num 0 0 0 0 0 0 0 0 0 1 ...
## $ locationİnşaatçılar m. : num 0 0 0 0 0 0 0 1 0 0 ...
## $ locationMemar Əcəmi m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəriman Nərimanov m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNərimanov r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəsimi r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationOthers : num 1 0 1 0 0 1 1 0 1 0 ...
## $ locationŞah İsmayıl Xətai m. : num 0 1 0 0 0 0 0 0 0 0 ...
## $ rooms : num 3 3 4 3 4 2 3 3 2 3 ...
## $ new_buildingYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_repairYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_bill_of_saleYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_mortgageYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ floor : num 5 19 7 8 12 10 4 9 15 10 ...
## $ total_floors : num 12 20 18 10 15 16 6 18 16 17 ...
## $ regionResidential District : num 0 0 1 0 0 0 0 0 0 0 ...
## $ regionSuburban Area : num 0 0 0 0 0 0 1 0 0 0 ...
## $ log_price : num 12.6 12.8 13.5 12.4 12.8 ...
## $ log_square : num 4.95 4.91 5.35 4.47 5.16 ...
## $ interaction_log_square_floor : num 24.7 93.3 37.5 35.7 62 ...
## $ interaction_log_square_new_building: num 9.9 9.83 10.7 8.93 10.33 ...
## $ interaction_floor_has_mortgage : num 10 38 14 16 24 20 8 18 30 20 ...
## $ interaction_log_square_rooms : num 14.8 14.7 21.4 13.4 20.7 ...
## $ interaction_floor_rooms : num 15 57 28 24 48 20 12 27 30 30 ...
## $ log_square_squared : num 24.5 24.1 28.6 19.9 26.7 ...
## $ floor_cubed : num 125 6859 343 512 1728 ...
ncol(data_encoded)
## [1] 29
nrow(data_encoded)
## [1] 39302
head(data_encoded)
## location28 May m. location8 Noyabr m. locationElmlər Akademiyası m.
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 1
## 5 0 0 1
## 6 0 0 0
## locationHəzi Aslanov m. locationİnşaatçılar m. locationMemar Əcəmi m.
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## locationNəriman Nərimanov m. locationNərimanov r. locationNəsimi r.
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## locationOthers locationŞah İsmayıl Xətai m. rooms new_buildingYes
## 1 1 0 3 1
## 2 0 1 3 1
## 3 1 0 4 1
## 4 0 0 3 1
## 5 0 0 4 1
## 6 1 0 2 1
## has_repairYes has_bill_of_saleYes has_mortgageYes floor total_floors
## 1 1 1 1 5 12
## 2 1 1 1 19 20
## 3 1 1 1 7 18
## 4 1 1 1 8 10
## 5 1 1 1 12 15
## 6 1 1 1 10 16
## regionResidential District regionSuburban Area log_price log_square
## 1 0 0 12.55673 4.948760
## 2 0 0 12.77988 4.912655
## 3 1 0 13.53447 5.351858
## 4 0 0 12.40902 4.465908
## 5 0 0 12.76569 5.164786
## 6 0 0 12.44902 4.543295
## interaction_log_square_floor interaction_log_square_new_building
## 1 24.74380 9.897520
## 2 93.34044 9.825310
## 3 37.46301 10.703716
## 4 35.72726 8.931816
## 5 61.97743 10.329572
## 6 45.43295 9.086590
## interaction_floor_has_mortgage interaction_log_square_rooms
## 1 10 14.84628
## 2 38 14.73796
## 3 14 21.40743
## 4 16 13.39772
## 5 24 20.65914
## 6 20 9.08659
## interaction_floor_rooms log_square_squared floor_cubed
## 1 15 24.49022 125
## 2 57 24.13418 6859
## 3 28 28.64239 343
## 4 24 19.94434 512
## 5 48 26.67501 1728
## 6 20 20.64153 1000
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
# Fit a linear model
model <- lm(log_price ~ log_square + floor + floor_cubed + log_square_squared + interaction_log_square_floor + interaction_log_square_new_building + interaction_floor_has_mortgage + interaction_log_square_rooms + interaction_floor_rooms, data = baku_apartment_data_logged)
# Calculate VIF
vif_values <- vif(model)
print(vif_values)
## log_square floor
## 218.801164 216.789746
## floor_cubed log_square_squared
## 5.153233 270.258685
## interaction_log_square_floor interaction_log_square_new_building
## 393.153458 2.932818
## interaction_floor_has_mortgage interaction_log_square_rooms
## 3.300673 17.281943
## interaction_floor_rooms
## 43.385990
library(dplyr)
# Updating the data_encoded dataset
data_encoded <- data_encoded %>%
select(
-log_square_squared, # Remove log_square_squared
-interaction_log_square_floor # Remove interaction_log_square_floor
)
# Verify the changes and structure of the updated dataset
str(data_encoded)
## 'data.frame': 39302 obs. of 27 variables:
## $ location28 May m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ location8 Noyabr m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationElmlər Akademiyası m. : num 0 0 0 1 1 0 0 0 0 0 ...
## $ locationHəzi Aslanov m. : num 0 0 0 0 0 0 0 0 0 1 ...
## $ locationİnşaatçılar m. : num 0 0 0 0 0 0 0 1 0 0 ...
## $ locationMemar Əcəmi m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəriman Nərimanov m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNərimanov r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəsimi r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationOthers : num 1 0 1 0 0 1 1 0 1 0 ...
## $ locationŞah İsmayıl Xətai m. : num 0 1 0 0 0 0 0 0 0 0 ...
## $ rooms : num 3 3 4 3 4 2 3 3 2 3 ...
## $ new_buildingYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_repairYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_bill_of_saleYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_mortgageYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ floor : num 5 19 7 8 12 10 4 9 15 10 ...
## $ total_floors : num 12 20 18 10 15 16 6 18 16 17 ...
## $ regionResidential District : num 0 0 1 0 0 0 0 0 0 0 ...
## $ regionSuburban Area : num 0 0 0 0 0 0 1 0 0 0 ...
## $ log_price : num 12.6 12.8 13.5 12.4 12.8 ...
## $ log_square : num 4.95 4.91 5.35 4.47 5.16 ...
## $ interaction_log_square_new_building: num 9.9 9.83 10.7 8.93 10.33 ...
## $ interaction_floor_has_mortgage : num 10 38 14 16 24 20 8 18 30 20 ...
## $ interaction_log_square_rooms : num 14.8 14.7 21.4 13.4 20.7 ...
## $ interaction_floor_rooms : num 15 57 28 24 48 20 12 27 30 30 ...
## $ floor_cubed : num 125 6859 343 512 1728 ...
# Fit a linear model with selected variables
model <- lm(log_price ~ rooms + floor + total_floors + log_square +
interaction_log_square_new_building + interaction_floor_has_mortgage +
interaction_log_square_rooms + interaction_floor_rooms + floor_cubed,
data = data_encoded)
# Load necessary library
library(car)
# Calculate and print VIF
vif_values <- vif(model)
print(vif_values)
## rooms floor
## 48.207924 15.998556
## total_floors log_square
## 3.041088 8.532022
## interaction_log_square_new_building interaction_floor_has_mortgage
## 4.114478 3.302992
## interaction_log_square_rooms interaction_floor_rooms
## 77.606066 15.932090
## floor_cubed
## 4.871472
# Fit a new linear model without 'rooms' and 'interaction_log_square_rooms'
updated_model <- lm(log_price ~ floor + total_floors + log_square +
interaction_log_square_new_building + interaction_floor_has_mortgage,
data = data_encoded)
# If needed, ensure other parts of your analysis that used 'rooms' are adjusted or reevaluated.
# Load necessary library
library(car)
# Calculate and print new VIF values
new_vif_values <- vif(updated_model)
print(new_vif_values)
## floor total_floors
## 3.898824 2.969334
## log_square interaction_log_square_new_building
## 2.053790 3.712131
## interaction_floor_has_mortgage
## 3.297039
library(dplyr)
# Update data_encoded by removing specified variables
data_encoded <- data_encoded %>%
select(
-floor_cubed,
-interaction_floor_rooms,
-interaction_log_square_rooms,
-rooms
)
# Verify the structure of the updated dataset
str(data_encoded)
## 'data.frame': 39302 obs. of 23 variables:
## $ location28 May m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ location8 Noyabr m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationElmlər Akademiyası m. : num 0 0 0 1 1 0 0 0 0 0 ...
## $ locationHəzi Aslanov m. : num 0 0 0 0 0 0 0 0 0 1 ...
## $ locationİnşaatçılar m. : num 0 0 0 0 0 0 0 1 0 0 ...
## $ locationMemar Əcəmi m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəriman Nərimanov m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNərimanov r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəsimi r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationOthers : num 1 0 1 0 0 1 1 0 1 0 ...
## $ locationŞah İsmayıl Xətai m. : num 0 1 0 0 0 0 0 0 0 0 ...
## $ new_buildingYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_repairYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_bill_of_saleYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_mortgageYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ floor : num 5 19 7 8 12 10 4 9 15 10 ...
## $ total_floors : num 12 20 18 10 15 16 6 18 16 17 ...
## $ regionResidential District : num 0 0 1 0 0 0 0 0 0 0 ...
## $ regionSuburban Area : num 0 0 0 0 0 0 1 0 0 0 ...
## $ log_price : num 12.6 12.8 13.5 12.4 12.8 ...
## $ log_square : num 4.95 4.91 5.35 4.47 5.16 ...
## $ interaction_log_square_new_building: num 9.9 9.83 10.7 8.93 10.33 ...
## $ interaction_floor_has_mortgage : num 10 38 14 16 24 20 8 18 30 20 ...
nrow(data_encoded)
## [1] 39302
ncol(data_encoded)
## [1] 23
library(dplyr)
# Assuming 'locationOthers' is the variable to be excluded
data_encoded <- data_encoded %>%
select(
-locationOthers # Add other variables to remove here as well, e.g., -variable1, -variable2
)
# Verify the structure of the updated dataset
str(data_encoded)
## 'data.frame': 39302 obs. of 22 variables:
## $ location28 May m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ location8 Noyabr m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationElmlər Akademiyası m. : num 0 0 0 1 1 0 0 0 0 0 ...
## $ locationHəzi Aslanov m. : num 0 0 0 0 0 0 0 0 0 1 ...
## $ locationİnşaatçılar m. : num 0 0 0 0 0 0 0 1 0 0 ...
## $ locationMemar Əcəmi m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəriman Nərimanov m. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNərimanov r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationNəsimi r. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ locationŞah İsmayıl Xətai m. : num 0 1 0 0 0 0 0 0 0 0 ...
## $ new_buildingYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_repairYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_bill_of_saleYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ has_mortgageYes : num 1 1 1 1 1 1 1 1 1 1 ...
## $ floor : num 5 19 7 8 12 10 4 9 15 10 ...
## $ total_floors : num 12 20 18 10 15 16 6 18 16 17 ...
## $ regionResidential District : num 0 0 1 0 0 0 0 0 0 0 ...
## $ regionSuburban Area : num 0 0 0 0 0 0 1 0 0 0 ...
## $ log_price : num 12.6 12.8 13.5 12.4 12.8 ...
## $ log_square : num 4.95 4.91 5.35 4.47 5.16 ...
## $ interaction_log_square_new_building: num 9.9 9.83 10.7 8.93 10.33 ...
## $ interaction_floor_has_mortgage : num 10 38 14 16 24 20 8 18 30 20 ...
Summary
# Fit the linear regression model using all predictors in data_encoded
model <- lm(log_price ~ ., data = data_encoded)
# Display the summary of the model to see the coefficients and statistics
summary(model)
##
## Call:
## lm(formula = log_price ~ ., data = data_encoded)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72106 -0.13880 -0.00728 0.12868 2.49264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.2803945 0.0302044 274.145 < 2e-16 ***
## `location28 May m.` 0.1467243 0.0065236 22.491 < 2e-16 ***
## `location8 Noyabr m.` 0.0465046 0.0070985 6.551 5.77e-11 ***
## `locationElmlər Akademiyası m.` 0.1449670 0.0065312 22.196 < 2e-16 ***
## `locationHəzi Aslanov m.` -0.1870247 0.0059136 -31.626 < 2e-16 ***
## `locationİnşaatçılar m.` -0.1026770 0.0052660 -19.498 < 2e-16 ***
## `locationMemar Əcəmi m.` -0.1056256 0.0061108 -17.285 < 2e-16 ***
## `locationNəriman Nərimanov m.` 0.1124459 0.0054052 20.803 < 2e-16 ***
## `locationNərimanov r.` 0.2062878 0.0078985 26.117 < 2e-16 ***
## `locationNəsimi r.` 0.2267927 0.0071857 31.562 < 2e-16 ***
## `locationŞah İsmayıl Xətai m.` 0.1092030 0.0059098 18.478 < 2e-16 ***
## new_buildingYes -0.6201738 0.0342797 -18.092 < 2e-16 ***
## has_repairYes 0.1320820 0.0037267 35.442 < 2e-16 ***
## has_bill_of_saleYes 0.0046467 0.0035555 1.307 0.19125
## has_mortgageYes -0.0429767 0.0053495 -8.034 9.71e-16 ***
## floor -0.0066657 0.0008028 -8.303 < 2e-16 ***
## total_floors 0.0072447 0.0004085 17.734 < 2e-16 ***
## `regionResidential District` -0.1147504 0.0054709 -20.975 < 2e-16 ***
## `regionSuburban Area` -0.2177366 0.0040446 -53.834 < 2e-16 ***
## log_square 0.6752652 0.0145670 46.356 < 2e-16 ***
## interaction_log_square_new_building 0.1436735 0.0078914 18.206 < 2e-16 ***
## interaction_floor_has_mortgage 0.0016222 0.0005329 3.044 0.00234 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2459 on 39280 degrees of freedom
## Multiple R-squared: 0.7954, Adjusted R-squared: 0.7953
## F-statistic: 7271 on 21 and 39280 DF, p-value: < 2.2e-16
# Fit the reduced model without 'has_bill_of_saleYes'
reduced_model <- lm(log_price ~ . - has_bill_of_saleYes, data = data_encoded)
# Perform ANOVA comparison
model_comparison <- anova(model, reduced_model)
print(model_comparison)
## Analysis of Variance Table
##
## Model 1: log_price ~ `location28 May m.` + `location8 Noyabr m.` + `locationElmlər Akademiyası m.` +
## `locationHəzi Aslanov m.` + `locationİnşaatçılar m.` +
## `locationMemar Əcəmi m.` + `locationNəriman Nərimanov m.` +
## `locationNərimanov r.` + `locationNəsimi r.` + `locationŞah İsmayıl Xətai m.` +
## new_buildingYes + has_repairYes + has_bill_of_saleYes + has_mortgageYes +
## floor + total_floors + `regionResidential District` + `regionSuburban Area` +
## log_square + interaction_log_square_new_building + interaction_floor_has_mortgage
## Model 2: log_price ~ (`location28 May m.` + `location8 Noyabr m.` + `locationElmlər Akademiyası m.` +
## `locationHəzi Aslanov m.` + `locationİnşaatçılar m.` +
## `locationMemar Əcəmi m.` + `locationNəriman Nərimanov m.` +
## `locationNərimanov r.` + `locationNəsimi r.` + `locationŞah İsmayıl Xətai m.` +
## new_buildingYes + has_repairYes + has_bill_of_saleYes + has_mortgageYes +
## floor + total_floors + `regionResidential District` + `regionSuburban Area` +
## log_square + interaction_log_square_new_building + interaction_floor_has_mortgage) -
## has_bill_of_saleYes
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 39280 2375.6
## 2 39281 2375.7 -1 -0.1033 1.708 0.1913
The p-value of 0.1913 from the ANOVA test is greater than the common significance level of 0.05, suggesting that the difference in the fit between the original model and the reduced model (without has_bill_of_saleYes) is not statistically significant. This means that removing has_bill_of_saleYes does not significantly worsen the model’s fit.
THE FINAL MODEL:
# Rename the reduced model to final_model
final_model <- reduced_model
# Display the summary of the final model
summary(final_model)
##
## Call:
## lm(formula = log_price ~ . - has_bill_of_saleYes, data = data_encoded)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72382 -0.13860 -0.00727 0.12856 2.49202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.2842176 0.0300627 275.564 < 2e-16 ***
## `location28 May m.` 0.1467025 0.0065236 22.488 < 2e-16 ***
## `location8 Noyabr m.` 0.0468381 0.0070940 6.602 4.09e-11 ***
## `locationElmlər Akademiyası m.` 0.1452867 0.0065267 22.260 < 2e-16 ***
## `locationHəzi Aslanov m.` -0.1869151 0.0059131 -31.610 < 2e-16 ***
## `locationİnşaatçılar m.` -0.1024154 0.0052622 -19.462 < 2e-16 ***
## `locationMemar Əcəmi m.` -0.1056365 0.0061109 -17.287 < 2e-16 ***
## `locationNəriman Nərimanov m.` 0.1121590 0.0054008 20.767 < 2e-16 ***
## `locationNərimanov r.` 0.2062880 0.0078985 26.117 < 2e-16 ***
## `locationNəsimi r.` 0.2267076 0.0071855 31.551 < 2e-16 ***
## `locationŞah İsmayıl Xətai m.` 0.1091699 0.0059098 18.473 < 2e-16 ***
## new_buildingYes -0.6265199 0.0339343 -18.463 < 2e-16 ***
## has_repairYes 0.1334520 0.0035763 37.316 < 2e-16 ***
## has_mortgageYes -0.0418576 0.0052805 -7.927 2.31e-15 ***
## floor -0.0066987 0.0008024 -8.348 < 2e-16 ***
## total_floors 0.0072524 0.0004085 17.754 < 2e-16 ***
## `regionResidential District` -0.1147580 0.0054709 -20.976 < 2e-16 ***
## `regionSuburban Area` -0.2177396 0.0040446 -53.834 < 2e-16 ***
## log_square 0.6739214 0.0145308 46.379 < 2e-16 ***
## interaction_log_square_new_building 0.1447786 0.0078461 18.452 < 2e-16 ***
## interaction_floor_has_mortgage 0.0016530 0.0005324 3.105 0.00191 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2459 on 39281 degrees of freedom
## Multiple R-squared: 0.7954, Adjusted R-squared: 0.7953
## F-statistic: 7634 on 20 and 39281 DF, p-value: < 2.2e-16
Gauss Markow 6 Golden Assumptions (BLUE)
Dataset isn’t time-series data and the observations are not inherently ordered in time or related to each other temporally, then the assumption regarding no autocorrelation of residuals (independence) typically does not require explicit testing. So 5 rules listed below:
Linearity: Ensure the relationship between the predictors and the response is linear. This can be visually inspected through plots of the residuals vs. fitted values.
Normality of Errors: This can be assessed with a Q-Q plot of the residuals. If the residuals deviate significantly from the line in a Q-Q plot, it may indicate non-normality, which can affect the validity of confidence intervals and hypothesis tests.
Multicollinearity: Check with VIF scores as you’ve already done. Remember, high VIFs suggest that linear dependencies among the explanatory variables could be inflating the variances of the estimated coefficients.
Homoscedasticity: Look for constant variance of residuals in the plot of residuals vs. fitted values. If you see patterns such as funnels or heteroscedasticity, it might violate this assumption.
Endogeneity: While tricky to test directly without deeper analysis or external instruments, careful model specification and understanding the data generation process are crucial. Make sure all relevant variables are included to avoid omitted variable bias.
# Ensure the 'lmtest' package is loaded for the RESET test
if (!require(lmtest)) {
install.packages("lmtest")
}
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(lmtest)
# Run Ramsey's RESET test for model specification
reset_test <- resettest(final_model)
print(reset_test)
##
## RESET test
##
## data: final_model
## RESET = 57.067, df1 = 2, df2 = 39279, p-value < 2.2e-16
In essence, the RESET test is warning that our model might not be capturing all the dynamics of the data, suggesting that further refinement is necessary to improve its accuracy and reliability.
# Load necessary package
if (!require(nortest)) {
install.packages("nortest")
}
## Loading required package: nortest
library(nortest)
# Perform Lilliefors (Kolmogorov-Smirnov) test for normality
lillie_test <- lillie.test(residuals(final_model))
print(lillie_test)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: residuals(final_model)
## D = 0.055862, p-value < 2.2e-16
The Lilliefors (Kolmogorov-Smirnov) test result you provided shows a p-value significantly less than 0.05 (p-value < 2.2e-16), which strongly suggests that the residuals of our model do not follow a normal distribution. The test statistic D = 0.055862 indicates the maximum deviation between the observed cumulative distribution of residuals and the expected cumulative distribution under normality.
# Select only numeric predictors
numeric_vars <- c("floor", "total_floors", "log_square",
"interaction_log_square_new_building",
"interaction_floor_has_mortgage")
# Fit a new linear model using only numeric predictors
numeric_model <- lm(log_price ~ ., data = data_encoded[, c(numeric_vars, "log_price")])
# Compute VIF for only numeric predictors
vif_numeric <- vif(numeric_model)
print(vif_numeric)
## floor total_floors
## 3.898824 2.969334
## log_square interaction_log_square_new_building
## 2.053790 3.712131
## interaction_floor_has_mortgage
## 3.297039
If VIF < 5 for all variables → Fail to reject H₀ (No significant multicollinearity, the model is stable). H0: There is no multicollinearity among the independent variables.
IV regression Application
if (!require(AER)) install.packages("AER") # IV regression
## Loading required package: AER
## Loading required package: sandwich
## Loading required package: survival
library(AER)
if (!require(lmtest)) install.packages("lmtest") # For hypothesis testing
library(lmtest)
# First stage: regress suspected endogenous variable on instruments
first_stage <- lm(log_square ~ total_floors, data = data_encoded)
# Save residuals from first-stage regression
data_encoded$residuals_iv <- residuals(first_stage)
# Second stage: Include residuals from first-stage regression
hausman_model <- lm(log_price ~ log_square + floor + total_floors +
interaction_log_square_new_building + interaction_floor_has_mortgage +
residuals_iv, data = data_encoded)
# Perform t-test on residual term to check for endogeneity
summary(hausman_model)
##
## Call:
## lm(formula = log_price ~ log_square + floor + total_floors +
## interaction_log_square_new_building + interaction_floor_has_mortgage +
## residuals_iv, data = data_encoded)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.79238 -0.15512 -0.00093 0.16533 2.52511
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6313932 0.0156546 487.487 < 2e-16 ***
## log_square 0.9885958 0.0043166 229.023 < 2e-16 ***
## floor -0.0017307 0.0005576 -3.104 0.00191 **
## total_floors 0.0115576 0.0004463 25.899 < 2e-16 ***
## interaction_log_square_new_building -0.0083612 0.0011527 -7.254 4.13e-13 ***
## interaction_floor_has_mortgage -0.0023762 0.0003027 -7.849 4.30e-15 ***
## residuals_iv NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2778 on 39296 degrees of freedom
## Multiple R-squared: 0.7388, Adjusted R-squared: 0.7388
## F-statistic: 2.223e+04 on 5 and 39296 DF, p-value: < 2.2e-16
The instrumented residual (residuals_iv) is perfectly collinear with other predictors, meaning that the model cannot estimate its effect separately. The suspected endogenous variable (log_square) is actually exogenous, meaning that it does not suffer from endogeneity.
Final Sumamry 1. Since residuals_iv is not estimable (NA), it does not contribute to explaining the dependent variable (log_price).
This suggests no significant endogeneity problem for log_square, meaning that OLS is a valid estimation method for your model.
Decision: We can continue using the OLS model confidently without needing to switch to instrumental variable (IV) regression.
# Install and load necessary package
if (!require(lmtest)) install.packages("lmtest")
library(lmtest)
# Perform Breusch-Pagan test
bp_test <- bptest(final_model)
# Print test result
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: final_model
## BP = 3512.6, df = 20, p-value < 2.2e-16
p-value > 0.05 → Fail to reject H₀ → No heteroscedasticity (Good). p-value < 0.05 → Reject H₀ → Heteroscedasticity detected (Problematic).
Since the p-value is extremely small (< 0.05), we reject the null hypothesis (H₀) of homoscedasticity. This means heteroscedasticity is present in the model, meaning the variance of residuals is not constant. This could lead to biased standard errors, affecting hypothesis testing (e.g., confidence intervals and p-values may not be reliable).
Robust Standard Error Application
# Install and load necessary package
if (!require(sandwich)) install.packages("sandwich")
if (!require(lmtest)) install.packages("lmtest")
library(sandwich)
library(lmtest)
# Compute robust standard errors (HC1) and re-run hypothesis tests
robust_se <- coeftest(final_model, vcov = vcovHC(final_model, type = "HC1"))
# Print results with robust standard errors
print(robust_se)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28421758 0.04104606 201.8273 < 2.2e-16
## `location28 May m.` 0.14670248 0.00598256 24.5217 < 2.2e-16
## `location8 Noyabr m.` 0.04683806 0.00545118 8.5923 < 2.2e-16
## `locationElmlər Akademiyası m.` 0.14528668 0.00596229 24.3676 < 2.2e-16
## `locationHəzi Aslanov m.` -0.18691513 0.00411735 -45.3969 < 2.2e-16
## `locationİnşaatçılar m.` -0.10241537 0.00448243 -22.8482 < 2.2e-16
## `locationMemar Əcəmi m.` -0.10563652 0.00409048 -25.8249 < 2.2e-16
## `locationNəriman Nərimanov m.` 0.11215903 0.00456917 24.5469 < 2.2e-16
## `locationNərimanov r.` 0.20628803 0.00766612 26.9091 < 2.2e-16
## `locationNəsimi r.` 0.22670757 0.00762101 29.7477 < 2.2e-16
## `locationŞah İsmayıl Xətai m.` 0.10916992 0.00559034 19.5283 < 2.2e-16
## new_buildingYes -0.62651987 0.04676083 -13.3984 < 2.2e-16
## has_repairYes 0.13345199 0.00422712 31.5704 < 2.2e-16
## has_mortgageYes -0.04185764 0.00544506 -7.6873 1.538e-14
## floor -0.00669868 0.00082175 -8.1517 3.694e-16
## total_floors 0.00725242 0.00055128 13.1556 < 2.2e-16
## `regionResidential District` -0.11475799 0.00637638 -17.9974 < 2.2e-16
## `regionSuburban Area` -0.21773958 0.00478879 -45.4686 < 2.2e-16
## log_square 0.67392139 0.02055820 32.7811 < 2.2e-16
## interaction_log_square_new_building 0.14477857 0.01094386 13.2292 < 2.2e-16
## interaction_floor_has_mortgage 0.00165295 0.00052434 3.1524 0.00162
##
## (Intercept) ***
## `location28 May m.` ***
## `location8 Noyabr m.` ***
## `locationElmlər Akademiyası m.` ***
## `locationHəzi Aslanov m.` ***
## `locationİnşaatçılar m.` ***
## `locationMemar Əcəmi m.` ***
## `locationNəriman Nərimanov m.` ***
## `locationNərimanov r.` ***
## `locationNəsimi r.` ***
## `locationŞah İsmayıl Xətai m.` ***
## new_buildingYes ***
## has_repairYes ***
## has_mortgageYes ***
## floor ***
## total_floors ***
## `regionResidential District` ***
## `regionSuburban Area` ***
## log_square ***
## interaction_log_square_new_building ***
## interaction_floor_has_mortgage **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fixes the issue without changing the model. Makes standard errors robust to heteroscedasticity. Ensures valid hypothesis testing even when heteroscedasticity exists.
# Print coefficients with robust standard errors
print(robust_se)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28421758 0.04104606 201.8273 < 2.2e-16
## `location28 May m.` 0.14670248 0.00598256 24.5217 < 2.2e-16
## `location8 Noyabr m.` 0.04683806 0.00545118 8.5923 < 2.2e-16
## `locationElmlər Akademiyası m.` 0.14528668 0.00596229 24.3676 < 2.2e-16
## `locationHəzi Aslanov m.` -0.18691513 0.00411735 -45.3969 < 2.2e-16
## `locationİnşaatçılar m.` -0.10241537 0.00448243 -22.8482 < 2.2e-16
## `locationMemar Əcəmi m.` -0.10563652 0.00409048 -25.8249 < 2.2e-16
## `locationNəriman Nərimanov m.` 0.11215903 0.00456917 24.5469 < 2.2e-16
## `locationNərimanov r.` 0.20628803 0.00766612 26.9091 < 2.2e-16
## `locationNəsimi r.` 0.22670757 0.00762101 29.7477 < 2.2e-16
## `locationŞah İsmayıl Xətai m.` 0.10916992 0.00559034 19.5283 < 2.2e-16
## new_buildingYes -0.62651987 0.04676083 -13.3984 < 2.2e-16
## has_repairYes 0.13345199 0.00422712 31.5704 < 2.2e-16
## has_mortgageYes -0.04185764 0.00544506 -7.6873 1.538e-14
## floor -0.00669868 0.00082175 -8.1517 3.694e-16
## total_floors 0.00725242 0.00055128 13.1556 < 2.2e-16
## `regionResidential District` -0.11475799 0.00637638 -17.9974 < 2.2e-16
## `regionSuburban Area` -0.21773958 0.00478879 -45.4686 < 2.2e-16
## log_square 0.67392139 0.02055820 32.7811 < 2.2e-16
## interaction_log_square_new_building 0.14477857 0.01094386 13.2292 < 2.2e-16
## interaction_floor_has_mortgage 0.00165295 0.00052434 3.1524 0.00162
##
## (Intercept) ***
## `location28 May m.` ***
## `location8 Noyabr m.` ***
## `locationElmlər Akademiyası m.` ***
## `locationHəzi Aslanov m.` ***
## `locationİnşaatçılar m.` ***
## `locationMemar Əcəmi m.` ***
## `locationNəriman Nərimanov m.` ***
## `locationNərimanov r.` ***
## `locationNəsimi r.` ***
## `locationŞah İsmayıl Xətai m.` ***
## new_buildingYes ***
## has_repairYes ***
## has_mortgageYes ***
## floor ***
## total_floors ***
## `regionResidential District` ***
## `regionSuburban Area` ***
## log_square ***
## interaction_log_square_new_building ***
## interaction_floor_has_mortgage **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the original final_model for predictions and coefficients. Using the robust standard errors (HC1) when interpreting p-values and confidence intervals.
print(robust_se)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28421758 0.04104606 201.8273 < 2.2e-16
## `location28 May m.` 0.14670248 0.00598256 24.5217 < 2.2e-16
## `location8 Noyabr m.` 0.04683806 0.00545118 8.5923 < 2.2e-16
## `locationElmlər Akademiyası m.` 0.14528668 0.00596229 24.3676 < 2.2e-16
## `locationHəzi Aslanov m.` -0.18691513 0.00411735 -45.3969 < 2.2e-16
## `locationİnşaatçılar m.` -0.10241537 0.00448243 -22.8482 < 2.2e-16
## `locationMemar Əcəmi m.` -0.10563652 0.00409048 -25.8249 < 2.2e-16
## `locationNəriman Nərimanov m.` 0.11215903 0.00456917 24.5469 < 2.2e-16
## `locationNərimanov r.` 0.20628803 0.00766612 26.9091 < 2.2e-16
## `locationNəsimi r.` 0.22670757 0.00762101 29.7477 < 2.2e-16
## `locationŞah İsmayıl Xətai m.` 0.10916992 0.00559034 19.5283 < 2.2e-16
## new_buildingYes -0.62651987 0.04676083 -13.3984 < 2.2e-16
## has_repairYes 0.13345199 0.00422712 31.5704 < 2.2e-16
## has_mortgageYes -0.04185764 0.00544506 -7.6873 1.538e-14
## floor -0.00669868 0.00082175 -8.1517 3.694e-16
## total_floors 0.00725242 0.00055128 13.1556 < 2.2e-16
## `regionResidential District` -0.11475799 0.00637638 -17.9974 < 2.2e-16
## `regionSuburban Area` -0.21773958 0.00478879 -45.4686 < 2.2e-16
## log_square 0.67392139 0.02055820 32.7811 < 2.2e-16
## interaction_log_square_new_building 0.14477857 0.01094386 13.2292 < 2.2e-16
## interaction_floor_has_mortgage 0.00165295 0.00052434 3.1524 0.00162
##
## (Intercept) ***
## `location28 May m.` ***
## `location8 Noyabr m.` ***
## `locationElmlər Akademiyası m.` ***
## `locationHəzi Aslanov m.` ***
## `locationİnşaatçılar m.` ***
## `locationMemar Əcəmi m.` ***
## `locationNəriman Nərimanov m.` ***
## `locationNərimanov r.` ***
## `locationNəsimi r.` ***
## `locationŞah İsmayıl Xətai m.` ***
## new_buildingYes ***
## has_repairYes ***
## has_mortgageYes ***
## floor ***
## total_floors ***
## `regionResidential District` ***
## `regionSuburban Area` ***
## log_square ***
## interaction_log_square_new_building ***
## interaction_floor_has_mortgage **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This study aimed to examine the key factors influencing apartment prices in Baku, focusing on location, infrastructure, building characteristics, mortgage availability, and floor level. Using a log-log model, the estimated coefficients represent elasticities, indicating the percentage change in price for a 1% change in a continuous variable, while categorical variables reflect relative price differences compared to a reference category.
1. Effect of Proximity to Key Amenities (Metro Stations and Infrastructure) The results show that proximity to metro stations significantly impacts apartment prices. Apartments located near major metro stations like 28 May m. (+14.7%), Elmlər Akademiyası m. (+14.5%), and Nərimanov r. (+20.6%) experience price premiums, reinforcing the importance of public transportation accessibility. Conversely, Həzi Aslanov m. (-18.7%) and İnşaatçılar m. (-10.2%) show price discounts, suggesting that some metro areas might not be as desirable, potentially due to congestion or lower surrounding infrastructure quality.
2. Floor Level and Apartment Prices in High-Rise Buildings The coefficient for floor (-0.67%) suggests that, holding all else constant, each additional floor decreases apartment price by 0.67%. This contradicts the assumption that higher floors are always more desirable, indicating that in Baku’s market, buyers might prefer lower floors due to elevator accessibility, fire safety concerns, or cultural preferences. However, the total number of floors in the building has a positive effect (+0.73%), suggesting that taller buildings generally command higher prices, likely due to better amenities, security, or architectural quality.
3. Price Differences Between New and Older Apartments Surprisingly, new buildings are associated with a significant price discount of -62.7%, after controlling for other factors. This suggests that new apartments might be smaller, located in less central areas, or have yet to establish market demand. However, interaction effects with log_square (+14.5%) indicate that larger apartments in new buildings still command a premium, emphasizing the role of size and modern features in valuation.
4. Impact of Mortgage Availability on Property Prices The availability of mortgages is negatively associated with price (-4.2%), suggesting that mortgage-backed properties tend to be slightly cheaper. This could be due to sellers being more flexible with pricing when buyers rely on financing, or mortgaged properties being concentrated in areas with lower baseline prices. However, the interaction between floor and mortgage availability (+0.16%) suggests that in some cases, higher-floor apartments with mortgage options see a slight price boost, possibly due to their desirability for investment purposes.
The study confirms that location, infrastructure, and building characteristics are major determinants of apartment prices in Baku. While proximity to metro stations increases property values, higher floors surprisingly decrease prices, and new constructions sell at a discount despite their modern features. Mortgage availability does not seem to significantly drive prices upward, suggesting that other economic or market constraints may play a role. These findings provide valuable insights for urban planners, real estate developers, and policymakers in optimizing housing policies and infrastructure planning.