The real estate industry frequently struggles with inaccurate
property valuation, driven by rapidly changing market
conditions and heterogeneous property characteristics.
This report demonstrates how Data Science and Machine
Learning can be applied to:
We use a real estate dataset containing structural attributes (area, number of bedrooms, bathrooms, stories) and qualitative features (road access, furnishing status, etc.) to build and evaluate predictive models.
library(tidyverse)
library(caret)
library(ggplot2)
library(randomForest)
library(Metrics)
library(corrplot)
# Load dataset
df <- read.csv("/Users/oyzx/Downloads/Housing_Price_Data.csv")
# Preview column names
names(df)## [1] "price" "area" "bedrooms" "bathrooms"
## [5] "stories" "mainroad" "guestroom" "basement"
## [9] "hotwaterheating" "airconditioning" "parking" "prefarea"
## [13] "furnishingstatus"
## price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420 4 2 3 yes no no
## 2 12250000 8960 4 4 4 yes no no
## 3 12250000 9960 3 2 2 yes no yes
## 4 12215000 7500 4 2 2 yes no yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes no yes
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 1 no yes 2 yes furnished
## 2 no yes 3 no furnished
## 3 no no 2 yes semi-furnished
## 4 no yes 3 yes furnished
## 5 no yes 2 no furnished
## 6 no yes 2 yes semi-furnished
We observe key variables such as:
price: selling price of the housearea: built-up areabedrooms, bathrooms,
storiesmainroad, guestroom,
basement, hotwaterheating,
airconditioning, parking,
prefarea, furnishingstatus## 'data.frame': 545 obs. of 13 variables:
## $ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
## $ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
## $ bedrooms : int 4 4 3 4 4 3 4 5 4 3 ...
## $ bathrooms : int 2 4 2 2 1 3 3 3 1 2 ...
## $ stories : int 3 4 2 2 2 1 4 2 2 4 ...
## $ mainroad : chr "yes" "yes" "yes" "yes" ...
## $ guestroom : chr "no" "no" "no" "no" ...
## $ basement : chr "no" "no" "yes" "yes" ...
## $ hotwaterheating : chr "no" "no" "no" "no" ...
## $ airconditioning : chr "yes" "yes" "no" "yes" ...
## $ parking : int 2 3 2 3 2 2 2 0 2 1 ...
## $ prefarea : chr "yes" "no" "yes" "yes" ...
## $ furnishingstatus: chr "furnished" "furnished" "semi-furnished" "furnished" ...
## price area bedrooms bathrooms
## Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
## Median : 4340000 Median : 4600 Median :3.000 Median :1.000
## Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286
## 3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000
## stories mainroad guestroom basement
## Min. :1.000 Length:545 Length:545 Length:545
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.806
## 3rd Qu.:2.000
## Max. :4.000
## hotwaterheating airconditioning parking prefarea
## Length:545 Length:545 Min. :0.0000 Length:545
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.6936
## 3rd Qu.:1.0000
## Max. :3.0000
## furnishingstatus
## Length:545
## Class :character
## Mode :character
##
##
##
The dataset contains both numerical and categorical variables, which is well-suited for tree-based models such as Random Forest.
We perform basic cleaning to ensure data quality:
# Remove missing values
df <- df %>% drop_na()
# Filter unrealistic area (example rule: > 200 sqft)
df <- df %>% filter(area > 200)
summary(df$area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1650 3600 4600 5151 6360 16200
We engineer additional variables and ensure that categorical fields are correctly encoded as factors.
# Create price per sqft
df <- df %>% mutate(price_per_sqft = price / area)
# Convert categorical variables to factors
df$mainroad <- as.factor(df$mainroad)
df$guestroom <- as.factor(df$guestroom)
df$basement <- as.factor(df$basement)
df$hotwaterheating <- as.factor(df$hotwaterheating)
df$airconditioning <- as.factor(df$airconditioning)
df$prefarea <- as.factor(df$prefarea)
df$furnishingstatus <- as.factor(df$furnishingstatus)
str(df)## 'data.frame': 545 obs. of 14 variables:
## $ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
## $ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
## $ bedrooms : int 4 4 3 4 4 3 4 5 4 3 ...
## $ bathrooms : int 2 4 2 2 1 3 3 3 1 2 ...
## $ stories : int 3 4 2 2 2 1 4 2 2 4 ...
## $ mainroad : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ guestroom : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 2 2 ...
## $ basement : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 1 ...
## $ hotwaterheating : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ airconditioning : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ parking : int 2 3 2 3 2 2 2 0 2 1 ...
## $ prefarea : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 2 2 ...
## $ furnishingstatus: Factor w/ 3 levels "furnished","semi-furnished",..: 1 1 2 1 1 2 2 3 1 3 ...
## $ price_per_sqft : num 1792 1367 1230 1629 1538 ...
ggplot(df, aes(price)) +
geom_density(fill = "#9b59b6", alpha = 0.4) +
labs(
title = "Distribution of House Prices",
x = "Price",
y = "Density"
) +
theme_minimal()ggplot(df, aes(bedrooms, price)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "#2980b9") +
labs(
title = "Relationship Between Bedrooms and Price",
x = "Number of Bedrooms",
y = "Price"
) +
theme_minimal()ggplot(df, aes(area, price)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "#e74c3c") +
labs(
title = "Area vs Price",
x = "Area",
y = "Price"
) +
theme_minimal()To understand how numerical features are related to price, we compute a correlation matrix.
numeric_df <- df %>%
select(price, area, bedrooms, bathrooms, stories, parking, price_per_sqft)
cor_mat <- cor(numeric_df)
corrplot(cor_mat, method = "color", addCoef.col = "black",
tl.col = "black", number.cex = 0.7,
main = "Correlation Heatmap of Numeric Features")Insights:
set.seed(123)
trainIndex <- createDataPartition(df$price, p = 0.8, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]
nrow(train); nrow(test)## [1] 438
## [1] 107
We compare a baseline Linear Regression model with a Random Forest model.
model_lm <- lm(
price ~ area + bedrooms + bathrooms + stories +
mainroad + guestroom + basement + hotwaterheating +
airconditioning + parking + prefarea + furnishingstatus,
data = train
)
summary(model_lm)##
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories +
## mainroad + guestroom + basement + hotwaterheating + airconditioning +
## parking + prefarea + furnishingstatus, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2674777 -654183 -41657 482010 4681273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42053.57 287850.03 -0.146 0.883915
## area 252.21 26.46 9.533 < 2e-16 ***
## bedrooms 134851.80 78837.38 1.711 0.087904 .
## bathrooms 976660.21 113788.88 8.583 < 2e-16 ***
## stories 449036.43 69467.54 6.464 2.81e-10 ***
## mainroadyes 464880.56 152408.07 3.050 0.002430 **
## guestroomyes 236678.68 145487.05 1.627 0.104522
## basementyes 417954.12 123001.36 3.398 0.000743 ***
## hotwaterheatingyes 727162.92 230758.29 3.151 0.001741 **
## airconditioningyes 813120.17 119183.43 6.822 3.10e-11 ***
## parking 276313.45 64808.00 4.264 2.48e-05 ***
## prefareayes 577198.71 126844.76 4.550 7.00e-06 ***
## furnishingstatussemi-furnished -91844.86 127520.35 -0.720 0.471776
## furnishingstatusunfurnished -430123.60 139861.29 -3.075 0.002239 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1052000 on 424 degrees of freedom
## Multiple R-squared: 0.6804, Adjusted R-squared: 0.6706
## F-statistic: 69.43 on 13 and 424 DF, p-value: < 2.2e-16
set.seed(123)
model_rf <- randomForest(
price ~ area + bedrooms + bathrooms + stories +
mainroad + guestroom + basement + hotwaterheating +
airconditioning + parking + prefarea + furnishingstatus,
data = train,
ntree = 300,
mtry = 4,
importance = TRUE
)
# Variable importance plot
varImpPlot(model_rf, main = "Random Forest Feature Importance")# Predictions on test set
pred_lm <- predict(model_lm, test)
pred_rf <- predict(model_rf, test)
# Compute evaluation metrics
rmse_lm <- rmse(test$price, pred_lm)
mae_lm <- mae(test$price, pred_lm)
rmse_rf <- rmse(test$price, pred_rf)
mae_rf <- mae(test$price, pred_rf)
rmse_lm; mae_lm## [1] 1138053
## [1] 834204
## [1] 1129612
## [1] 820556.4
cat('<div class="kpi-box">',
'<div class="kpi-label">Linear Regression RMSE</div>',
'<div class="kpi-value">', round(rmse_lm, 2), '</div>',
'</div>')cat('<div class="kpi-box">',
'<div class="kpi-label">Random Forest RMSE</div>',
'<div class="kpi-value">', round(rmse_rf, 2), '</div>',
'</div>')cat('<div class="kpi-box">',
'<div class="kpi-label">Linear Regression MAE</div>',
'<div class="kpi-value">', round(mae_lm, 2), '</div>',
'</div>')cat('<div class="kpi-box">',
'<div class="kpi-label">Random Forest MAE</div>',
'<div class="kpi-value">', round(mae_rf, 2), '</div>',
'</div>')comparison_df <- data.frame(
actual = test$price,
predicted = pred_rf
)
ggplot(comparison_df, aes(actual, predicted)) +
geom_point(color = "#27ae60", alpha = 0.6) +
geom_abline(color = "red", linetype = "dashed") +
labs(
title = "Actual vs Predicted Housing Prices (Random Forest)",
x = "Actual Price",
y = "Predicted Price"
) +
theme_minimal()Based on the analysis and modeling results:
area) and more bedrooms/bathrooms are
strongly associated with higher prices.From a real estate industry perspective, these insights can be used to:
This study shows that:
Future extensions could include: