1 Introduction

The real estate industry frequently struggles with inaccurate property valuation, driven by rapidly changing market conditions and heterogeneous property characteristics.
This report demonstrates how Data Science and Machine Learning can be applied to:

model the relationship between housing characteristics and price, and
build a predictive model to support data-driven valuation decisions.

We use a real estate dataset containing structural attributes (area, number of bedrooms, bathrooms, stories) and qualitative features (road access, furnishing status, etc.) to build and evaluate predictive models.

2 Load Libraries and Dataset

library(tidyverse)
library(caret)
library(ggplot2)
library(randomForest)
library(Metrics)
library(corrplot)

# Load dataset
df <- read.csv("/Users/oyzx/Downloads/Housing_Price_Data.csv")

# Preview column names
names(df)

##  [1] "price"            "area"             "bedrooms"         "bathrooms"       
##  [5] "stories"          "mainroad"         "guestroom"        "basement"        
##  [9] "hotwaterheating"  "airconditioning"  "parking"          "prefarea"        
## [13] "furnishingstatus"

head(df)

##      price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420        4         2       3      yes        no       no
## 2 12250000 8960        4         4       4      yes        no       no
## 3 12250000 9960        3         2       2      yes        no      yes
## 4 12215000 7500        4         2       2      yes        no      yes
## 5 11410000 7420        4         1       2      yes       yes      yes
## 6 10850000 7500        3         3       1      yes        no      yes
##   hotwaterheating airconditioning parking prefarea furnishingstatus
## 1              no             yes       2      yes        furnished
## 2              no             yes       3       no        furnished
## 3              no              no       2      yes   semi-furnished
## 4              no             yes       3      yes        furnished
## 5              no             yes       2       no        furnished
## 6              no             yes       2      yes   semi-furnished

We observe key variables such as:

price: selling price of the house
area: built-up area
bedrooms, bathrooms, stories
mainroad, guestroom, basement, hotwaterheating, airconditioning, parking, prefarea, furnishingstatus

3 Data Understanding

str(df)

## 'data.frame':    545 obs. of  13 variables:
##  $ price           : int  13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
##  $ area            : int  7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
##  $ bedrooms        : int  4 4 3 4 4 3 4 5 4 3 ...
##  $ bathrooms       : int  2 4 2 2 1 3 3 3 1 2 ...
##  $ stories         : int  3 4 2 2 2 1 4 2 2 4 ...
##  $ mainroad        : chr  "yes" "yes" "yes" "yes" ...
##  $ guestroom       : chr  "no" "no" "no" "no" ...
##  $ basement        : chr  "no" "no" "yes" "yes" ...
##  $ hotwaterheating : chr  "no" "no" "no" "no" ...
##  $ airconditioning : chr  "yes" "yes" "no" "yes" ...
##  $ parking         : int  2 3 2 3 2 2 2 0 2 1 ...
##  $ prefarea        : chr  "yes" "no" "yes" "yes" ...
##  $ furnishingstatus: chr  "furnished" "furnished" "semi-furnished" "furnished" ...

summary(df)

##      price               area          bedrooms       bathrooms    
##  Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 4340000   Median : 4600   Median :3.000   Median :1.000  
##  Mean   : 4766729   Mean   : 5151   Mean   :2.965   Mean   :1.286  
##  3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.000  
##     stories        mainroad          guestroom           basement        
##  Min.   :1.000   Length:545         Length:545         Length:545        
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.806                                                           
##  3rd Qu.:2.000                                                           
##  Max.   :4.000                                                           
##  hotwaterheating    airconditioning       parking         prefarea        
##  Length:545         Length:545         Min.   :0.0000   Length:545        
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.6936                     
##                                        3rd Qu.:1.0000                     
##                                        Max.   :3.0000                     
##  furnishingstatus  
##  Length:545        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

The dataset contains both numerical and categorical variables, which is well-suited for tree-based models such as Random Forest.

4 Data Cleaning

We perform basic cleaning to ensure data quality:

remove missing values
filter out unrealistic area values (e.g., extremely small houses)

# Remove missing values
df <- df %>% drop_na()

# Filter unrealistic area (example rule: > 200 sqft)
df <- df %>% filter(area > 200)

summary(df$area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1650    3600    4600    5151    6360   16200

5 Feature Engineering

We engineer additional variables and ensure that categorical fields are correctly encoded as factors.

# Create price per sqft
df <- df %>% mutate(price_per_sqft = price / area)

# Convert categorical variables to factors
df$mainroad         <- as.factor(df$mainroad)
df$guestroom        <- as.factor(df$guestroom)
df$basement         <- as.factor(df$basement)
df$hotwaterheating  <- as.factor(df$hotwaterheating)
df$airconditioning  <- as.factor(df$airconditioning)
df$prefarea         <- as.factor(df$prefarea)
df$furnishingstatus <- as.factor(df$furnishingstatus)

str(df)

## 'data.frame':    545 obs. of  14 variables:
##  $ price           : int  13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
##  $ area            : int  7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
##  $ bedrooms        : int  4 4 3 4 4 3 4 5 4 3 ...
##  $ bathrooms       : int  2 4 2 2 1 3 3 3 1 2 ...
##  $ stories         : int  3 4 2 2 2 1 4 2 2 4 ...
##  $ mainroad        : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ guestroom       : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 2 2 ...
##  $ basement        : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 1 ...
##  $ hotwaterheating : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ airconditioning : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ parking         : int  2 3 2 3 2 2 2 0 2 1 ...
##  $ prefarea        : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 2 2 ...
##  $ furnishingstatus: Factor w/ 3 levels "furnished","semi-furnished",..: 1 1 2 1 1 2 2 3 1 3 ...
##  $ price_per_sqft  : num  1792 1367 1230 1629 1538 ...

6 Exploratory Data Analysis (EDA)

6.1 Price Distribution

ggplot(df, aes(price)) +
  geom_density(fill = "#9b59b6", alpha = 0.4) +
  labs(
    title = "Distribution of House Prices",
    x = "Price",
    y = "Density"
  ) +
  theme_minimal()

6.2 Bedrooms vs Price

ggplot(df, aes(bedrooms, price)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "#2980b9") +
  labs(
    title = "Relationship Between Bedrooms and Price",
    x = "Number of Bedrooms",
    y = "Price"
  ) +
  theme_minimal()

6.3 Area vs Price

ggplot(df, aes(area, price)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "#e74c3c") +
  labs(
    title = "Area vs Price",
    x = "Area",
    y = "Price"
  ) +
  theme_minimal()

6.4 Furnishing Status vs Price

ggplot(df, aes(furnishingstatus, price, fill = furnishingstatus)) +
  geom_boxplot(alpha = 0.8) +
  labs(
    title = "Price Distribution by Furnishing Status",
    x = "Furnishing Status",
    y = "Price"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

6.5 Parking vs Price

ggplot(df, aes(factor(parking), price)) +
  geom_boxplot(fill = "#3498db", alpha = 0.8) +
  labs(
    title = "Price vs Number of Parking Spaces",
    x = "Parking Spaces",
    y = "Price"
  ) +
  theme_minimal()

7 Correlation Analysis

To understand how numerical features are related to price, we compute a correlation matrix.

numeric_df <- df %>%
  select(price, area, bedrooms, bathrooms, stories, parking, price_per_sqft)

cor_mat <- cor(numeric_df)

corrplot(cor_mat, method = "color", addCoef.col = "black",
         tl.col = "black", number.cex = 0.7,
         main = "Correlation Heatmap of Numeric Features")

Insights:

We expect price to be positively correlated with area, bedrooms, bathrooms, and stories.
Strong correlations highlight which features are likely to be key drivers in the predictive model.

8 Train–Test Split

set.seed(123)
trainIndex <- createDataPartition(df$price, p = 0.8, list = FALSE)
train <- df[trainIndex, ]
test  <- df[-trainIndex, ]

nrow(train); nrow(test)

## [1] 438

## [1] 107

9 Modeling

We compare a baseline Linear Regression model with a Random Forest model.

9.1 Linear Regression Model

model_lm <- lm(
  price ~ area + bedrooms + bathrooms + stories +
    mainroad + guestroom + basement + hotwaterheating +
    airconditioning + parking + prefarea + furnishingstatus,
  data = train
)

summary(model_lm)

## 
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories + 
##     mainroad + guestroom + basement + hotwaterheating + airconditioning + 
##     parking + prefarea + furnishingstatus, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2674777  -654183   -41657   482010  4681273 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -42053.57  287850.03  -0.146 0.883915    
## area                               252.21      26.46   9.533  < 2e-16 ***
## bedrooms                        134851.80   78837.38   1.711 0.087904 .  
## bathrooms                       976660.21  113788.88   8.583  < 2e-16 ***
## stories                         449036.43   69467.54   6.464 2.81e-10 ***
## mainroadyes                     464880.56  152408.07   3.050 0.002430 ** 
## guestroomyes                    236678.68  145487.05   1.627 0.104522    
## basementyes                     417954.12  123001.36   3.398 0.000743 ***
## hotwaterheatingyes              727162.92  230758.29   3.151 0.001741 ** 
## airconditioningyes              813120.17  119183.43   6.822 3.10e-11 ***
## parking                         276313.45   64808.00   4.264 2.48e-05 ***
## prefareayes                     577198.71  126844.76   4.550 7.00e-06 ***
## furnishingstatussemi-furnished  -91844.86  127520.35  -0.720 0.471776    
## furnishingstatusunfurnished    -430123.60  139861.29  -3.075 0.002239 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1052000 on 424 degrees of freedom
## Multiple R-squared:  0.6804, Adjusted R-squared:  0.6706 
## F-statistic: 69.43 on 13 and 424 DF,  p-value: < 2.2e-16

9.2 Random Forest Model

set.seed(123)
model_rf <- randomForest(
  price ~ area + bedrooms + bathrooms + stories +
    mainroad + guestroom + basement + hotwaterheating +
    airconditioning + parking + prefarea + furnishingstatus,
  data = train,
  ntree = 300,
  mtry = 4,
  importance = TRUE
)

# Variable importance plot
varImpPlot(model_rf, main = "Random Forest Feature Importance")

10 Model Evaluation

# Predictions on test set
pred_lm <- predict(model_lm, test)
pred_rf <- predict(model_rf, test)

# Compute evaluation metrics
rmse_lm <- rmse(test$price, pred_lm)
mae_lm  <- mae(test$price, pred_lm)

rmse_rf <- rmse(test$price, pred_rf)
mae_rf  <- mae(test$price, pred_rf)

rmse_lm; mae_lm

## [1] 1138053

## [1] 834204

rmse_rf; mae_rf

## [1] 1129612

## [1] 820556.4

10.1 KPI Summary

cat('<div class="kpi-box">',
    '<div class="kpi-label">Linear Regression RMSE</div>',
    '<div class="kpi-value">', round(rmse_lm, 2), '</div>',
    '</div>')

Linear Regression RMSE

1138053

cat('<div class="kpi-box">',
    '<div class="kpi-label">Random Forest RMSE</div>',
    '<div class="kpi-value">', round(rmse_rf, 2), '</div>',
    '</div>')

Random Forest RMSE

1129612

cat('<div class="kpi-box">',
    '<div class="kpi-label">Linear Regression MAE</div>',
    '<div class="kpi-value">', round(mae_lm, 2), '</div>',
    '</div>')

Linear Regression MAE

834204

cat('<div class="kpi-box">',
    '<div class="kpi-label">Random Forest MAE</div>',
    '<div class="kpi-value">', round(mae_rf, 2), '</div>',
    '</div>')

Random Forest MAE

820556.4

10.2 Actual vs Predicted (Random Forest)

comparison_df <- data.frame(
  actual    = test$price,
  predicted = pred_rf
)

ggplot(comparison_df, aes(actual, predicted)) +
  geom_point(color = "#27ae60", alpha = 0.6) +
  geom_abline(color = "red", linetype = "dashed") +
  labs(
    title = "Actual vs Predicted Housing Prices (Random Forest)",
    x = "Actual Price",
    y = "Predicted Price"
  ) +
  theme_minimal()

11 Findings & Business Implications

Based on the analysis and modeling results:

Size matters: Larger houses (higher area) and more bedrooms/bathrooms are strongly associated with higher prices.
Quality and amenities matter: Features such as air conditioning, basement, and preferred area (prefarea) contribute positively to price.
Parking capacity: Houses with more parking spaces tend to be priced higher, reflecting convenience and added value in dense urban settings.
Furnishing status: Fully or semi-furnished houses often command a price premium over unfurnished properties.

From a real estate industry perspective, these insights can be used to:

Support fair and consistent pricing when listing properties.
Identify high-value renovation opportunities (e.g., adding air conditioning or improving furnishing).
Help buyers understand which attributes drive price differences in the market.
Assist banks and valuers in building automated valuation models (AVMs) for mortgage risk assessment.

12 Conclusion & Future Work

This study shows that:

Data Science techniques can significantly improve housing price prediction.
The Random Forest model outperforms Linear Regression in terms of RMSE and MAE, capturing non-linear relationships and interactions between features.
Key drivers of housing price include area, bedrooms, bathrooms, stories, and several amenity-related features.

Future extensions could include:

Incorporating location-based variables (e.g., distance to city center, schools, public transport).
Using more advanced models (e.g., Gradient Boosting, XGBoost) and performing hyperparameter tuning.
Deploying the model as a simple web-based price prediction tool for agents and buyers.

13 Appendix: Export Predictions

write.csv(comparison_df, "model_predictions_rf.csv", row.names = FALSE)

Data Science for Real Estate: Housing Price Prediction

OUYANG ZIXUAN (24215697)

03 January 2026