Group Members:

Name Student ID
BIAN CHEN FANG 24233604
SHAO ZITONG 24235094
DENG LINGYAN 24235830
Jasmine Fu Ming Yee 24217068
WANG JUNJIE 23073465

Supervisor: Assoc. Prof. Dr. Ang Tan Fong

Group division:

Stage Objective Owner
Project Kick-off Confirm the dataset All
Confirm individual roles and responsibilities All
Project understanding + task allocation BIAN CHEN FANG
Data Engineering & EDA 1. Project Overview (Introduction) SHAO ZITONG
WANG JUNJIE
2. Data Understanding
3. Data Cleaning & Pre-processing
4. Exploratory Data Analysis (EDA)
Modelling & Evaluation 5. Modeling & Evaluation Jasmine Fu Ming Yee
DENG LINGYAN
6. Conclusion & Discussion
Integration & Delivery Final report formatting BIAN CHEN FANG
Presentation preparation & video recording/editing BIAN CHEN FANG
Final review & assignment submission ALL

1. Introduction

1.1 Project Background

The real estate market is a vital component of the global economy. It not only meets basic housing needs but also serves as a key asset for personal and institutional investment. This makes accurate property valuation essential for making trading decisions and managing credit risk.

However, house prices are influenced by a wide range of factors. These span from macro elements like location and neighborhood to micro details such as building structure and furnishing. This mixture makes pricing trends complex and non-linear. Traditional valuation methods often rely on an appraiser’s personal experience or simple comparisons. This approach is often inefficient and prone to subjective bias. It also struggles to quantify exactly how specific features, such as garage size or basement condition, contribute to the final price.

Using data science to understand pricing mechanisms is now a key industry trend. This project aims to use statistical analysis and predictive models to filter noise from the data. Our goal is to build an objective pricing model and identify the key factors that drive value differences. Through this, we hope to explore the potential of data science in real estate and provide an objective, data-driven perspective on price fluctuations.

1.2 Dataset Overview

This project uses the “House Price” dataset from Kaggle, which supports in-depth analysis and price prediction for the real estate market.

The dataset records detailed physical attributes and transaction backgrounds. Unlike simpler datasets, it goes beyond basic room counts like Bathrooms and Balconies. It includes specific details such as Carpet Area, Super Area, Furnishing status, Floor level, and Transaction Type.

With its rich features and clear structure, this dataset is ideal for building multiple regression models to predict the Total Amount. It also offers great potential for classification tasks, such as analyzing how different furnishing levels or locations determine property value.

1.3 Problem Statement

While real estate data seems transparent, pricing mechanisms are often subjective and complex.

In reality, property value is not decided by a single metric. It depends on a mix of factors, including building specifications such as the difference between super and carpet area, furnishing status, and transaction details. With such multidimensional data, relying on intuition or manual estimates is insufficient. It is difficult to capture the complex relationships between these variables, which often leads to valuation errors.

Without a systematic statistical framework, we cannot accurately measure how features interact. This prevents us from building an objective prediction model and leaves the core factors driving price fluctuations unidentified.

1.4 Project Objectives

Based on the problem statement above, this project establishes the following core objectives, covering the entire process from building a data foundation to supporting business decisions.

NO. Objective Description
1 Build Data Foundation Clean and preprocess raw data to address missing values and outliers. Use Exploratory Data Analysis (EDA) to visualize variable distributions and reveal initial links between physical attributes (e.g., area, furnishing) and price.
2 Establish Price Prediction Framework Build and compare baseline and advanced regression models. Optimize algorithms to select the best model, quantify feature impact, and achieve accurate predictions of the total amount.
3 Analyze Price Drivers Convert continuous prices into discrete levels. Build classification models to identify key determinants that distinguish price ranges, analyzing price formation from a new dimension.
4 Model Evaluation & Decision Support Systematically evaluate model performance and compare results. Summarize core patterns based on data analysis to provide practical business suggestions and strategic outlooks for market participants.

Before beginning our analysis, we need to load the necessary R packages for data manipulation, visualization, and modeling.

# Data manipulation
library(tidyverse)
library(dplyr)
library(tidyr)
library(stringr)

# Visualization
library(ggplot2)
library(corrplot)
library(scales)

# Data summary
library(skimr)
library(summarytools)

2. Data Understanding

🎯 Core Objective: Explore the dataset to understand its structure, content, and quality, and identify potential issues for data cleaning.

Function Purpose
read.csv() Load dataset from CSV file
nrow(), ncol() Get dataset dimensions
str() Display variable structure and types
summary() Statistical summary of variables
is.na(), sapply() Count and analyze missing values
ggplot() + geom_bar() Visualize categorical distributions
table() Frequency counts for categories

Data understanding is a critical first step in any data science project. In this section, we will explore the dataset to understand its structure, content, and quality. This will help us identify potential issues that need to be addressed during data cleaning.

2.1 Load the Dataset

We begin by loading the dataset into R. The dataset is stored in CSV format, which is a common format for tabular data.

# Load the dataset
data <- read.csv("house_prices.csv", stringsAsFactors = FALSE)

2.2 Data Source and Description

The dataset used in this analysis is the “House Price” dataset obtained from Kaggle, compiled by Juhi Bhojani. It comprises real estate listings from various cities across India, containing information about property characteristics, pricing, and transaction details.

Source: https://www.kaggle.com/datasets/juhibhojani/house-price

2.3 Dataset Dimensions

Understanding the size of our dataset helps us assess the scope of our analysis. A larger dataset generally provides more reliable statistical insights.

# Get dimensions
cat("Number of Rows:", nrow(data), "\n")
## Number of Rows: 187531
cat("Number of Columns:", ncol(data), "\n")
## Number of Columns: 21

Our dataset contains 187,531 property listings with 21 variables. This is a substantial dataset that provides a comprehensive view of the Indian real estate market.

2.4 Variable Structure and Types

Understanding the data types of each variable is essential for proper data handling. Variables may be stored as integers, numeric values, or character strings.

# Display column names and types
str(data)
## 'data.frame':    187531 obs. of  21 variables:
##  $ Index            : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Title            : chr  "1 BHK Ready to Occupy Flat for sale in Srushti Siddhi Mangal Murti Complex Bhiwandi" "2 BHK Ready to Occupy Flat for sale in Dosti Vihar Pokhran Road" "2 BHK Ready to Occupy Flat for sale in Sunrise by Kalpataru Kolshet Road" "1 BHK Ready to Occupy Flat for sale Kasheli" ...
##  $ Description      : chr  "Bhiwandi, Thane has an attractive 1 BHK Flat for sale. The property is ideally located in a strategic location "| __truncated__ "One can find this stunning 2 BHK flat for sale in Pokhran Road, Thane. It enjoys an excellent location within t"| __truncated__ "Up for immediate sale is a 2 BHK apartment in Kolshet Road, Thane. Don't miss this bargain flat for sale. Situa"| __truncated__ "This beautiful 1 BHK Flat is available for sale in Kasheli, Thane. This flat for resale has a desirable locatio"| __truncated__ ...
##  $ Amount.in.rupees.: chr  "42 Lac " "98 Lac " "1.40 Cr " "25 Lac " ...
##  $ Price..in.rupees.: int  6000 13799 17500 NA 18824 6618 2538 10435 10000 11150 ...
##  $ location         : chr  "thane" "thane" "thane" "thane" ...
##  $ Carpet.Area      : chr  "500 sqft" "473 sqft" "779 sqft" "530 sqft" ...
##  $ Status           : chr  "Ready to Move" "Ready to Move" "Ready to Move" "Ready to Move" ...
##  $ Floor            : chr  "10 out of 11" "3 out of 22" "10 out of 29" "1 out of 3" ...
##  $ Transaction      : chr  "Resale" "Resale" "Resale" "Resale" ...
##  $ Furnishing       : chr  "Unfurnished" "Semi-Furnished" "Unfurnished" "Unfurnished" ...
##  $ facing           : chr  "" "East" "East" "" ...
##  $ overlooking      : chr  "" "Garden/Park" "Garden/Park" "" ...
##  $ Society          : chr  "Srushti Siddhi Mangal Murti Complex" "Dosti Vihar" "Sunrise by Kalpataru" "" ...
##  $ Bathroom         : chr  "1" "2" "2" "1" ...
##  $ Balcony          : chr  "2" "" "" "1" ...
##  $ Car.Parking      : chr  "" "1 Open" "1 Covered" "" ...
##  $ Ownership        : chr  "" "Freehold" "Freehold" "" ...
##  $ Super.Area       : chr  "" "" "" "" ...
##  $ Dimensions       : logi  NA NA NA NA NA NA ...
##  $ Plot.Area        : logi  NA NA NA NA NA NA ...

The output reveals several important observations about our data structure:

  • Numeric variables: Index, Price..in.rupees., Dimensions, Plot.Area

  • Character variables: Title, Description, Amount.in.rupees., location, Carpet.Area, Status, Floor, Transaction, Furnishing, facing, overlooking, Society, Bathroom, Balcony, Car.Parking, Ownership, Super.Area

Notably, several variables that should be numeric (such as Amount.in.rupees., Carpet.Area, Bathroom) are stored as character types due to formatting issues in the original data. This will need to be addressed during data cleaning.

2.5 Variable Descriptions

The following table provides a description of each variable in the dataset:

Variable Description Expected Type
Index Row identifier Integer
Title Property listing title Text
Description Detailed property description Text
Amount.in.rupees. Total price (in Lac/Cr format) Numeric
Price..in.rupees. Price per square foot Numeric
location City/area name Categorical
Carpet.Area Usable floor area (sqft) Numeric
Status Property status Categorical
Floor Floor information Numeric
Transaction Transaction type Categorical
Furnishing Furnishing status Categorical
facing Direction property faces Categorical
overlooking View from property Categorical
Society Housing society name Text
Bathroom Number of bathrooms Numeric
Balcony Number of balconies Numeric
Car.Parking Parking information Numeric
Ownership Ownership type Categorical
Super.Area Built-up area (sqft) Numeric
Dimensions Property dimensions Numeric
Plot.Area Plot area Numeric

2.6 Statistical Summary

A statistical summary provides an overview of the distribution of numeric variables, including measures of central tendency and spread.

summary(data)
##      Index           Title           Description        Amount.in.rupees. 
##  Min.   :     0   Length:187531      Length:187531      Length:187531     
##  1st Qu.: 46883   Class :character   Class :character   Class :character  
##  Median : 93765   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 93765                                                           
##  3rd Qu.:140648                                                           
##  Max.   :187530                                                           
##                                                                           
##  Price..in.rupees.   location         Carpet.Area           Status         
##  Min.   :      0   Length:187531      Length:187531      Length:187531     
##  1st Qu.:   4297   Class :character   Class :character   Class :character  
##  Median :   6034   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   7584                                                           
##  3rd Qu.:   9450                                                           
##  Max.   :6700000                                                           
##  NA's   :17665                                                             
##     Floor           Transaction         Furnishing           facing         
##  Length:187531      Length:187531      Length:187531      Length:187531     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  overlooking          Society            Bathroom           Balcony         
##  Length:187531      Length:187531      Length:187531      Length:187531     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Car.Parking         Ownership          Super.Area        Dimensions    
##  Length:187531      Length:187531      Length:187531      Mode:logical  
##  Class :character   Class :character   Class :character   NA's:187531   
##  Mode  :character   Mode  :character   Mode  :character                 
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  Plot.Area     
##  Mode:logical  
##  NA's:187531   
##                
##                
##                
##                
## 

2.7 Missing Values Analysis

The identification and quantification of missing values is crucial for determining appropriate data cleaning strategies. The following analysis presents the extent of missing data across all variables.

# Count missing values for each column
missing_values <- data.frame(
  Column = names(data),
  Missing_Count = sapply(data, function(x) sum(is.na(x) | x == "" | x == "NA")),
  Missing_Percentage = sapply(data, function(x) round(sum(is.na(x) | x == "" | x == "NA") / length(x) * 100, 2))
)

# Sort by missing percentage
missing_values <- missing_values[order(-missing_values$Missing_Percentage), ]
rownames(missing_values) <- NULL

# Display
print(missing_values)
##               Column Missing_Count Missing_Percentage
## 1         Dimensions        187531             100.00
## 2          Plot.Area        187531             100.00
## 3            Society        109678              58.49
## 4         Super.Area        107685              57.42
## 5        Car.Parking        103357              55.11
## 6        overlooking         81436              43.43
## 7        Carpet.Area         80673              43.02
## 8             facing         70233              37.45
## 9          Ownership         65517              34.94
## 10           Balcony         48935              26.09
## 11 Price..in.rupees.         17665               9.42
## 12             Floor          7077               3.77
## 13       Description          3023               1.61
## 14        Furnishing          2897               1.54
## 15          Bathroom           828               0.44
## 16            Status           615               0.33
## 17       Transaction            83               0.04
## 18             Index             0               0.00
## 19             Title             0               0.00
## 20 Amount.in.rupees.             0               0.00
## 21          location             0               0.00

The missing value analysis reveals critical data quality issues: - Dimensions and Plot.Area have 100% missing values, making them unusable for analysis - Society (58.49%), Super.Area (57.42%), and Car.Parking (55.11%) have more than half of their values missing - Several other variables including overlooking, Carpet.Area, facing, Ownership, and Balcony have significant missing rates (26-43%)

2.8 Categorical Variables Distribution

Understanding the distribution of categorical variables provides insights into the composition of the dataset and potential class imbalances.

2.8.1 Geographic Distribution

The geographic distribution of properties is presented in Figure 1. This analysis reveals the representation of different cities within the dataset.

ggplot(data, aes(x = reorder(location, location, length), fill = location)) +
  geom_bar(width = 0.7) +
  geom_text(stat = "count", aes(label = ..count..), hjust = -0.2, size = 1) +
  coord_flip() +
  labs(title = "Distribution of Properties by Location",
       x = "Location", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.y = element_text(size = 4))
Figure 1: Distribution of Properties by Location

Figure 1: Distribution of Properties by Location

The results indicate that the dataset encompasses over 70 cities across India. The highest concentrations of listings are observed in New Delhi (27,599), Bangalore (24,030), Kolkata (22,380), and Gurgaon (20,070), reflecting the major metropolitan real estate markets in India.

2.8.2 Transaction Type Distribution

The distribution of transaction types is illustrated in Figure 2, distinguishing between resale properties and new developments.

transaction_dist <- table(data$Transaction)
print(transaction_dist)
## 
##              New Property        Other   Rent/Lease       Resale 
##           83        42565          709            2       144172
data %>%
  filter(!is.na(Transaction) & Transaction != "") %>%
  ggplot(aes(x = Transaction, fill = Transaction)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "Distribution by Transaction Type",
       x = "Transaction Type", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

The analysis reveals that resale properties constitute the majority of listings (144,172; 76.9%), followed by new properties (42,565; 22.7%). This distribution suggests a mature secondary housing market in India.

2.8.3 Furnishing Status Distribution

The furnishing status of properties is presented in Figure 3, categorised as Furnished, Semi-Furnished, or Unfurnished.

furnishing_dist <- table(data$Furnishing)
print(furnishing_dist)
## 
##                     Furnished Semi-Furnished    Unfurnished 
##           2897          20162          88318          76154
data %>%
  filter(!is.na(Furnishing) & Furnishing != "") %>%
  ggplot(aes(x = Furnishing, fill = Furnishing)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "Distribution by Furnishing Status",
       x = "Furnishing Status", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

The distribution indicates that semi-furnished properties represent the largest category (88,318; 47.8%), followed by unfurnished properties (76,154; 41.3%). Fully furnished properties constitute a smaller proportion (20,162; 10.9%).

2.8.4 Ownership Type Distribution

The ownership type distribution is illustrated in Figure 4, showing the prevalence of different ownership structures.

data %>%
  filter(!is.na(Ownership) & Ownership != "") %>%
  ggplot(aes(x = Ownership, fill = Ownership)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "Distribution by Ownership Type",
       x = "Ownership Type", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

The results indicate that Freehold properties, which confer complete ownership rights, represent the predominant ownership type in the dataset.

2.9 Key Findings from Data Understanding

The data understanding phase has revealed several key findings that will inform subsequent data cleaning and analysis procedures:

  1. Dataset Scale: The dataset comprises 187,531 property listings with 21 variables, providing a substantial sample for analysis of the Indian real estate market.

  2. Geographic Coverage: Property listings span over 70 cities across India, with major metropolitan areas including New Delhi, Bangalore, Kolkata, and Gurgaon accounting for the highest representation.

  3. Market Composition: The market is predominantly composed of resale properties (76.9%), with new properties representing 22.7% of listings. This indicates a mature secondary housing market.

  4. Property Characteristics: The majority of properties are semi-furnished (47.8%) or unfurnished (41.3%), with fully furnished properties representing only 10.9% of the market.

  5. Data Quality Issues: Significant missing values were identified, with Dimensions and Plot.Area exhibiting 100% missing rates. Additionally, Society (58.49%), Super.Area (57.42%), and Car.Parking (55.11%) demonstrate substantial missing values requiring appropriate handling strategies.

  6. Data Type Inconsistencies: Several variables containing numeric values are stored as character types due to formatting issues (e.g., “42 Lac”, “500 sqft”, “3 out of 10”). These require conversion to numeric format during data cleaning.

  7. Format Standardisation Required: Price values contain “Lac” and “Cr” suffixes, area values include “sqft” units, and floor information follows an “X out of Y” format. All these require standardisation for quantitative analysis.

3. Data Cleaning and Pre-processing

🎯 Core Objective: Clean and transform raw data to address missing values, format inconsistencies, and prepare a structured dataset for analysis.

Function Purpose
select() Remove unnecessary columns
gsub(), as.numeric() Extract and convert numeric values from text
grepl() Pattern matching for unit conversion (Lac/Cr)
regmatches(), gregexpr() Extract floor numbers from text
trimws(), tolower() Standardize categorical variables
filter(), mutate() Handle missing values and create new variables
ifelse() Conditional value assignment
median() Impute missing numeric values
write.csv() Export cleaned dataset

Based on the findings from the data understanding phase, this section documents the systematic cleaning and transformation procedures applied to prepare the dataset for analysis.

3.1 Create a Copy of Original Data

Preserve original data by creating a working copy before any modifications.

# Create a copy for cleaning
data_clean <- data
cat("Original dataset dimensions:", dim(data_clean), "\n")
## Original dataset dimensions: 187531 21

3.2 Removal of Unnecessary Variables

Variables that do not contribute to the analysis are removed based on the following criteria:

  1. identifier columns with no analytical value

  2. free-text fields unsuitable for quantitative analysis

  3. variables with excessive unique values causing sparsity

  4. variables with complete missing data

# Remove columns that are not useful for analysis
# - Index: Just a row number
# - Title, Description: Text data not useful for modeling
# - Society: Too many unique values (1856), sparse information
# - Plot.Area, Dimensions: 100% missing

cols_to_remove <- c("Index", "Title", "Description", "Society", "Plot.Area", "Dimensions")

data_clean <- data_clean %>%
  select(-all_of(cols_to_remove))

cat("Columns removed:", paste(cols_to_remove, collapse = ", "), "\n")
## Columns removed: Index, Title, Description, Society, Plot.Area, Dimensions
cat("Remaining columns:", ncol(data_clean), "\n")
## Remaining columns: 15
cat("Column names:", paste(names(data_clean), collapse = ", "), "\n")
## Column names: Amount.in.rupees., Price..in.rupees., location, Carpet.Area, Status, Floor, Transaction, Furnishing, facing, overlooking, Bathroom, Balcony, Car.Parking, Ownership, Super.Area

3.3 Price Variable Conversion

The Amount.in.rupees. variable contains prices in Indian notation with “Lac” (Lakh = 100,000 INR) and “Cr” (Crore = 10,000,000 INR) suffixes. A standardised numeric format in Lakhs is created for quantitative analysis.

# Function to convert price string to numeric (in Lakhs)
convert_price <- function(price_str) {
  if (is.na(price_str) || price_str == "") {
    return(NA)
  }
  
  # Remove extra spaces and convert to lowercase
  price_str <- trimws(tolower(price_str))
  
  # Extract numeric value
  num <- as.numeric(gsub("[^0-9.]", "", price_str))
  
  if (is.na(num)) {
    return(NA)
  }
  
  # Convert based on unit
  if (grepl("cr", price_str)) {
    return(num * 100)  # Convert Crore to Lac
  } else if (grepl("lac", price_str)) {
    return(num)  # Already in Lac
  } else {
    return(NA)
  }
}

# Apply conversion
data_clean$Price_Lac <- sapply(data_clean$Amount.in.rupees., convert_price)

# Check the conversion
cat("Sample of price conversion:\n")
## Sample of price conversion:
head(data.frame(
  Original = data_clean$Amount.in.rupees.,
  Converted_Lac = data_clean$Price_Lac
), 15)
##     Original Converted_Lac
## 1    42 Lac           42.0
## 2    98 Lac           98.0
## 3   1.40 Cr          140.0
## 4    25 Lac           25.0
## 5   1.60 Cr          160.0
## 6    45 Lac           45.0
## 7  16.5 Lac           16.5
## 8    60 Lac           60.0
## 9    60 Lac           60.0
## 10  1.60 Cr          160.0
## 11  1.40 Cr          140.0
## 12  1.36 Cr          136.0
## 13  1.35 Cr          135.0
## 14  4.25 Cr          425.0
## 15   75 Lac           75.0
# Summary of converted prices
summary(data_clean$Price_Lac)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      1.0     48.4     78.0    119.8    145.0 140030.0     9684

3.4 Area Variables Conversion

The Carpet.Area and Super.Area variables contain values with “sqft” suffixes. Numeric values are extracted for quantitative analysis. Carpet Area represents the actual usable floor area, while Super Area includes common spaces and is typically 20-30% larger.

# Function to extract numeric area value

extract_area <- function(area_str) {
  if (is.na(area_str) || area_str == "") {
    return(NA)
  }
  
  # Extract numeric value
  num <- as.numeric(gsub("[^0-9.]", "", area_str))
  return(num)
}

# Apply to Carpet.Area
data_clean$Carpet_Area_sqft <- sapply(data_clean$Carpet.Area, extract_area)

# Apply to Super.Area
data_clean$Super_Area_sqft <- sapply(data_clean$Super.Area, extract_area)

# Check the conversion
cat("Sample of Carpet Area conversion:\n")
## Sample of Carpet Area conversion:
head(data.frame(
  Original = data_clean$Carpet.Area,
  Converted = data_clean$Carpet_Area_sqft
), 10)
##    Original Converted
## 1  500 sqft       500
## 2  473 sqft       473
## 3  779 sqft       779
## 4  530 sqft       530
## 5  635 sqft       635
## 6                  NA
## 7  550 sqft       550
## 8                  NA
## 9                  NA
## 10 900 sqft       900
cat("\nSummary of Carpet Area (sqft):\n")
## 
## Summary of Carpet Area (sqft):
summary(data_clean$Carpet_Area_sqft)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1     745    1000    1200    1500  709222   80673
cat("\nSummary of Super Area (sqft):\n")
## 
## Summary of Super Area (sqft):
summary(data_clean$Super_Area_sqft)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1     925    1250    1323    1650  530040  107685

3.5 Floor Information Extraction

The Floor variable contains values in “X out of Y” format. Two separate numeric variables are extracted: Current_Floor (the floor where the property is located) and Total_Floors (the total number of floors in the building).

# Function to extract floor information
extract_floor <- function(floor_str) {
  if (is.na(floor_str) || floor_str == "") {
    return(c(NA, NA))
  }
  
  # Handle "Ground" as floor 0
  floor_str <- gsub("Ground", "0", floor_str, ignore.case = TRUE)
  
  # Extract numbers
  numbers <- as.numeric(unlist(regmatches(floor_str, gregexpr("[0-9]+", floor_str))))
  
  if (length(numbers) >= 2) {
    return(c(numbers[1], numbers[2]))  # current floor, total floors
  } else {
    return(c(NA, NA))
  }
}

# Apply extraction
floor_info <- t(sapply(data_clean$Floor, extract_floor))
data_clean$Current_Floor <- as.numeric(floor_info[, 1])
data_clean$Total_Floors <- as.numeric(floor_info[, 2])

# Check the conversion
cat("Sample of Floor conversion:\n")
## Sample of Floor conversion:
head(data.frame(
  Original = data_clean$Floor,
  Current_Floor = data_clean$Current_Floor,
  Total_Floors = data_clean$Total_Floors
), 15)
##           Original Current_Floor Total_Floors
## 1     10 out of 11            10           11
## 2      3 out of 22             3           22
## 3     10 out of 29            10           29
## 4       1 out of 3             1            3
## 5     20 out of 42            20           42
## 6       2 out of 7             2            7
## 7       4 out of 5             4            5
## 8  Ground out of 7             0            7
## 9  Ground out of 2             0            2
## 10     3 out of 27             3           27
## 11     6 out of 20             6           20
## 12    16 out of 24            16           24
## 13     8 out of 20             8           20
## 14    18 out of 27            18           27
## 15      2 out of 3             2            3

3.6 Bathroom and Balcony Conversion

The Bathroom and Balcony variables contain primarily numeric values with some entries in “> 10” format. These are converted to proper numeric values.

# Function to clean numeric columns with "> X" format
clean_numeric_col <- function(x) {
  if (is.na(x) || x == "") {
    return(NA)
  }
  
  x <- as.character(x)
  
  # Handle "> 10" type values
  if (grepl(">", x)) {
    num <- as.numeric(gsub("[^0-9]", "", x))
    return(num + 1)  # "> 10" becomes 11
  }
  
  return(as.numeric(x))
}

# Apply to Bathroom
data_clean$Bathroom_num <- sapply(data_clean$Bathroom, clean_numeric_col)

# Apply to Balcony
data_clean$Balcony_num <- sapply(data_clean$Balcony, clean_numeric_col)

# Check
cat("Bathroom distribution:\n")
## Bathroom distribution:
table(data_clean$Bathroom_num, useNA = "ifany")
## 
##     1     2     3     4     5     6     7     8     9    10    11  <NA> 
## 18654 93007 55781 15600  3343   209    35    14    11    14    35   828
cat("\nBalcony distribution:\n")
## 
## Balcony distribution:
table(data_clean$Balcony_num, useNA = "ifany")
## 
##     1     2     3     4     5     6     7     8     9    10    11  <NA> 
## 49219 51809 27111  9420   841   132    14    13     2    13    22 48935

3.7 Parking Information Extraction

The Car.Parking variable contains values such as “1 Covered” and “2 Open”. The numeric count of parking spaces is extracted.

# Function to extract parking count
extract_parking <- function(park_str) {
  if (is.na(park_str) || park_str == "") {
    return(NA)
  }
  
  # Extract first number
  num <- as.numeric(gsub("([0-9]+).*", "\\1", park_str))
  return(num)
}

data_clean$Parking_num <- sapply(data_clean$Car.Parking, extract_parking)

# Check
cat("Parking distribution (top 10):\n")
## Parking distribution (top 10):
head(sort(table(data_clean$Parking_num), decreasing = TRUE), 10)
## 
##     1     2    10     3    34   402     8     4     5     6 
## 63618 17258   871   585   573   318   215   172    71    40

3.8 Categorical Variable Standardisation

Categorical variables are standardised through case normalisation, whitespace trimming, and conversion of empty strings to NA values.

# Standardize location (lowercase)
data_clean$location <- tolower(trimws(data_clean$location))

# Standardize Furnishing
data_clean$Furnishing <- trimws(data_clean$Furnishing)
data_clean$Furnishing[data_clean$Furnishing == ""] <- NA

# Standardize Transaction
data_clean$Transaction <- trimws(data_clean$Transaction)
data_clean$Transaction[data_clean$Transaction == ""] <- NA

# Standardize Ownership
data_clean$Ownership <- trimws(data_clean$Ownership)
data_clean$Ownership[data_clean$Ownership == ""] <- NA

# Standardize facing
data_clean$facing <- trimws(data_clean$facing)
data_clean$facing[data_clean$facing == ""] <- NA

# Check unique values
cat("Unique Locations:", unique(data_clean$location), "\n")
## Unique Locations: thane navi-mumbai nagpur mumbai ahmedabad bangalore chennai gurgaon hyderabad indore jaipur kolkata lucknow new-delhi noida pune agra ahmadnagar allahabad aurangabad badlapur belgaum bhiwadi bhiwandi bhopal bhubaneswar chandigarh coimbatore dehradun durgapur ernakulam faridabad ghaziabad goa greater-noida guntur guwahati gwalior haridwar jabalpur jamshedpur jodhpur kalyan kanpur kochi kozhikode ludhiana madurai mangalore mohali mysore nashik navsari nellore palakkad palghar panchkula patna pondicherry raipur rajahmundry ranchi satara shimla siliguri solapur sonipat surat thrissur tirupati trichy trivandrum udaipur udupi vadodara vapi varanasi vijayawada visakhapatnam vrindavan zirakpur
cat("Unique Furnishing:", unique(data_clean$Furnishing), "\n")
## Unique Furnishing: Unfurnished Semi-Furnished Furnished NA
cat("Unique Transaction:", unique(data_clean$Transaction), "\n")
## Unique Transaction: Resale New Property NA Other Rent/Lease
cat("Unique Ownership:", unique(data_clean$Ownership), "\n")
## Unique Ownership: NA Freehold Co-operative Society Power Of Attorney Leasehold

3.9 Removal of Original Columns

Following the creation of cleaned variables, original columns are removed to maintain dataset clarity and prevent redundancy.

# Remove original columns that have been cleaned
cols_to_remove_2 <- c("Amount.in.rupees.", "Carpet.Area", "Super.Area", 
                       "Floor", "Bathroom", "Balcony", "Car.Parking",
                       "Status", "overlooking")

data_clean <- data_clean %>%
  select(-all_of(cols_to_remove_2))

# Rename Price..in.rupees. to Price_per_sqft
names(data_clean)[names(data_clean) == "Price..in.rupees."] <- "Price_per_sqft"

cat("Cleaned dataset columns:\n")
## Cleaned dataset columns:
names(data_clean)
##  [1] "Price_per_sqft"   "location"         "Transaction"      "Furnishing"      
##  [5] "facing"           "Ownership"        "Price_Lac"        "Carpet_Area_sqft"
##  [9] "Super_Area_sqft"  "Current_Floor"    "Total_Floors"     "Bathroom_num"    
## [13] "Balcony_num"      "Parking_num"

3.10 Missing Value Treatment

Missing values are addressed using appropriate strategies based on variable type and importance. The target variable (Price_Lac) with missing values requires row deletion, numeric variables are imputed using median values (robust to outliers), and categorical variables are assigned an “Unknown” category.

# Check missing values in cleaned dataset
missing_clean <- data.frame(
  Column = names(data_clean),
  Missing_Count = sapply(data_clean, function(x) sum(is.na(x))),
  Missing_Pct = sapply(data_clean, function(x) round(sum(is.na(x)) / length(x) * 100, 2))
)
missing_clean <- missing_clean[order(-missing_clean$Missing_Pct), ]
print(missing_clean)
##                            Column Missing_Count Missing_Pct
## Super_Area_sqft   Super_Area_sqft        107685       57.42
## Parking_num           Parking_num        103357       55.11
## Carpet_Area_sqft Carpet_Area_sqft         80673       43.02
## facing                     facing         70233       37.45
## Ownership               Ownership         65517       34.94
## Balcony_num           Balcony_num         48935       26.09
## Price_per_sqft     Price_per_sqft         17665        9.42
## Price_Lac               Price_Lac          9684        5.16
## Current_Floor       Current_Floor          7487        3.99
## Total_Floors         Total_Floors          7487        3.99
## Furnishing             Furnishing          2897        1.54
## Bathroom_num         Bathroom_num           828        0.44
## Transaction           Transaction            83        0.04
## location                 location             0        0.00
# For numeric columns with missing values, we have options:
# 1. Remove rows with NA in critical columns (Price_Lac, Carpet_Area_sqft)
# 2. Impute with median for less critical columns

# Remove rows where Price_Lac is missing (required for regression)
data_clean <- data_clean %>%
  filter(!is.na(Price_Lac))

cat("\nRows after removing NA in Price_Lac:", nrow(data_clean), "\n")
## 
## Rows after removing NA in Price_Lac: 177847
# For Carpet_Area_sqft, if missing, try to use Super_Area_sqft
data_clean <- data_clean %>%
  mutate(Area_sqft = ifelse(is.na(Carpet_Area_sqft), Super_Area_sqft, Carpet_Area_sqft))

# Impute missing numeric values with median
data_clean$Bathroom_num[is.na(data_clean$Bathroom_num)] <- median(data_clean$Bathroom_num, na.rm = TRUE)
data_clean$Balcony_num[is.na(data_clean$Balcony_num)] <- median(data_clean$Balcony_num, na.rm = TRUE)
data_clean$Parking_num[is.na(data_clean$Parking_num)] <- median(data_clean$Parking_num, na.rm = TRUE)

# For categorical variables, impute with "Unknown" or mode
data_clean$Furnishing[is.na(data_clean$Furnishing)] <- "Unknown"
data_clean$Transaction[is.na(data_clean$Transaction)] <- "Unknown"
data_clean$Ownership[is.na(data_clean$Ownership)] <- "Unknown"
data_clean$facing[is.na(data_clean$facing)] <- "Unknown"

cat("\nFinal missing values check:\n")
## 
## Final missing values check:
sapply(data_clean, function(x) sum(is.na(x)))
##   Price_per_sqft         location      Transaction       Furnishing 
##             7981                0                0                0 
##           facing        Ownership        Price_Lac Carpet_Area_sqft 
##                0                0                0            76325 
##  Super_Area_sqft    Current_Floor     Total_Floors     Bathroom_num 
##           101612             7329             7329                0 
##      Balcony_num      Parking_num        Area_sqft 
##                0                0               90

Note: Outliers are retained in this analysis as they may represent legitimate high-value luxury properties in the real estate market.

3.11 Final Dataset Preparation

The final cleaned dataset is prepared with selected variables organised for subsequent analysis.

# Check for outliers in Price_Lac using IQR method
Q1 <- quantile(data_clean$Price_Lac, 0.25, na.rm = TRUE)
Q3 <- quantile(data_clean$Price_Lac, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

cat("Price_Lac outlier bounds:\n")
## Price_Lac outlier bounds:
cat("Lower bound:", lower_bound, "Lac\n")
## Lower bound: -96.5 Lac
cat("Upper bound:", upper_bound, "Lac\n")
## Upper bound: 289.9 Lac
outliers_count <- sum(data_clean$Price_Lac < lower_bound | data_clean$Price_Lac > upper_bound, na.rm = TRUE)
cat("Number of outliers:", outliers_count, "\n")
## Number of outliers: 12283
# Visualize outliers
ggplot(data_clean, aes(y = Price_Lac)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(title = "Price Distribution (Lac) - Before Outlier Removal",
       y = "Price (Lac)") +
  theme_minimal()

# Option: Remove extreme outliers (keep for now, can be removed if needed)
# data_clean <- data_clean %>%
#   filter(Price_Lac >= lower_bound & Price_Lac <= upper_bound)

3.12 Dataset Export

The cleaned dataset is exported for future analytical procedures.

# Select and order final columns
data_final <- data_clean %>%
  select(
    location,
    Price_Lac,
    Price_per_sqft,
    Area_sqft,
    Carpet_Area_sqft,
    Super_Area_sqft,
    Current_Floor,
    Total_Floors,
    Bathroom_num,
    Balcony_num,
    Parking_num,
    Transaction,
    Furnishing,
    facing,
    Ownership
  )

# Display structure of final dataset
cat("Final Cleaned Dataset Structure:\n")
## Final Cleaned Dataset Structure:
str(data_final)
## 'data.frame':    177847 obs. of  15 variables:
##  $ location        : chr  "thane" "thane" "thane" "thane" ...
##  $ Price_Lac       : num  42 98 140 25 160 45 16.5 60 60 160 ...
##  $ Price_per_sqft  : int  6000 13799 17500 NA 18824 6618 2538 10435 10000 11150 ...
##  $ Area_sqft       : num  500 473 779 530 635 680 550 575 600 900 ...
##  $ Carpet_Area_sqft: num  500 473 779 530 635 NA 550 NA NA 900 ...
##  $ Super_Area_sqft : num  NA NA NA NA NA 680 NA 575 600 NA ...
##  $ Current_Floor   : num  10 3 10 1 20 2 4 0 0 3 ...
##  $ Total_Floors    : num  11 22 29 3 42 7 5 7 2 27 ...
##  $ Bathroom_num    : num  1 2 2 1 2 1 1 1 1 3 ...
##  $ Balcony_num     : num  2 2 2 1 2 1 2 2 2 1 ...
##  $ Parking_num     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Transaction     : chr  "Resale" "Resale" "Resale" "Resale" ...
##  $ Furnishing      : chr  "Unfurnished" "Semi-Furnished" "Unfurnished" "Unfurnished" ...
##  $ facing          : chr  "Unknown" "East" "East" "Unknown" ...
##  $ Ownership       : chr  "Unknown" "Freehold" "Freehold" "Unknown" ...
cat("\nFinal Dataset Dimensions:", dim(data_final), "\n")
## 
## Final Dataset Dimensions: 177847 15
# Display first few rows
head(data_final, 10)
##    location Price_Lac Price_per_sqft Area_sqft Carpet_Area_sqft Super_Area_sqft
## 1     thane      42.0           6000       500              500              NA
## 2     thane      98.0          13799       473              473              NA
## 3     thane     140.0          17500       779              779              NA
## 4     thane      25.0             NA       530              530              NA
## 5     thane     160.0          18824       635              635              NA
## 6     thane      45.0           6618       680               NA             680
## 7     thane      16.5           2538       550              550              NA
## 8     thane      60.0          10435       575               NA             575
## 9     thane      60.0          10000       600               NA             600
## 10    thane     160.0          11150       900              900              NA
##    Current_Floor Total_Floors Bathroom_num Balcony_num Parking_num Transaction
## 1             10           11            1           2           1      Resale
## 2              3           22            2           2           1      Resale
## 3             10           29            2           2           1      Resale
## 4              1            3            1           1           1      Resale
## 5             20           42            2           2           1      Resale
## 6              2            7            1           1           1      Resale
## 7              4            5            1           2           1      Resale
## 8              0            7            1           2           1      Resale
## 9              0            2            1           2           1      Resale
## 10             3           27            3           1           1      Resale
##        Furnishing  facing            Ownership
## 1     Unfurnished Unknown              Unknown
## 2  Semi-Furnished    East             Freehold
## 3     Unfurnished    East             Freehold
## 4     Unfurnished Unknown              Unknown
## 5     Unfurnished    West Co-operative Society
## 6     Unfurnished    East Co-operative Society
## 7     Unfurnished Unknown              Unknown
## 8       Furnished Unknown              Unknown
## 9       Furnished Unknown Co-operative Society
## 10    Unfurnished    East             Freehold
# Summary statistics
summary(data_final)
##    location           Price_Lac        Price_per_sqft      Area_sqft     
##  Length:177847      Min.   :     1.0   Min.   :      0   Min.   :     1  
##  Class :character   1st Qu.:    48.4   1st Qu.:   4297   1st Qu.:   833  
##  Mode  :character   Median :    78.0   Median :   6034   Median :  1150  
##                     Mean   :   119.8   Mean   :   7584   Mean   :  1269  
##                     3rd Qu.:   145.0   3rd Qu.:   9450   3rd Qu.:  1570  
##                     Max.   :140030.0   Max.   :6700000   Max.   :709222  
##                                        NA's   :7981      NA's   :90      
##  Carpet_Area_sqft   Super_Area_sqft  Current_Floor     Total_Floors    
##  Min.   :     1.0   Min.   :     1   Min.   :  0.00   Min.   :  1.000  
##  1st Qu.:   770.2   1st Qu.:   955   1st Qu.:  2.00   1st Qu.:  4.000  
##  Median :  1050.0   Median :  1256   Median :  3.00   Median :  5.000  
##  Mean   :  1226.4   Mean   :  1325   Mean   :  4.51   Mean   :  8.801  
##  3rd Qu.:  1500.0   3rd Qu.:  1650   3rd Qu.:  6.00   3rd Qu.: 11.000  
##  Max.   :709222.0   Max.   :530040   Max.   :200.00   Max.   :200.000  
##  NA's   :76325      NA's   :101612   NA's   :7329     NA's   :7329     
##   Bathroom_num     Balcony_num     Parking_num      Transaction       
##  Min.   : 1.000   Min.   : 1.00   Min.   :  1.000   Length:177847     
##  1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.:  1.000   Class :character  
##  Median : 2.000   Median : 2.00   Median :  1.000   Mode  :character  
##  Mean   : 2.448   Mean   : 2.01   Mean   :  2.385                     
##  3rd Qu.: 3.000   3rd Qu.: 2.00   3rd Qu.:  1.000                     
##  Max.   :11.000   Max.   :11.00   Max.   :999.000                     
##                                                                       
##   Furnishing           facing           Ownership        
##  Length:177847      Length:177847      Length:177847     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

3.13 Save Cleaned Dataset

# Save cleaned dataset
write.csv(data_final, "house_price_cleaned.csv", row.names = FALSE)
cat("Cleaned dataset saved as 'house_price_cleaned.csv'\n")
## Cleaned dataset saved as 'house_price_cleaned.csv'

3.14 Data Cleaning Summary

3.14.1 Cleaning Procedures Applied

Step Procedure Description
1 Variable removal Removed Index, Title, Description, Society, Plot.Area, Dimensions
2 Price conversion Converted “Lac”/“Cr” format to numeric values in Lakhs
3 Area conversion Extracted numeric values from Carpet.Area and Super.Area
4 Floor extraction Derived Current_Floor and Total_Floors from Floor variable
5 Numeric conversion Converted Bathroom and Balcony to numeric format
6 Parking extraction Extracted parking count from Car.Parking
7 Standardisation Normalised categorical variables
8 Missing value treatment Applied median imputation and category assignment
9 Area combination Created combined Area_sqft variable

3.14.2 Final Dataset Variables

Variable Type Description
location Categorical City name (5 categories)
Price_Lac Numeric Total price in Lakhs (target variable)
Price_per_sqft Numeric Price per square foot
Area_sqft Numeric Combined area in sqft
Carpet_Area_sqft Numeric Carpet area in sqft
Super_Area_sqft Numeric Super built-up area in sqft
Current_Floor Numeric Current floor number
Total_Floors Numeric Total floors in building
Bathroom_num Numeric Number of bathrooms
Balcony_num Numeric Number of balconies
Parking_num Numeric Number of parking spaces
Transaction Categorical Resale/New Property
Furnishing Categorical Furnished/Semi-Furnished/Unfurnished
facing Categorical Direction property faces
Ownership Categorical Freehold/Leasehold/etc.

4. Exploratory Data Analysis (EDA)

🎯 Core Objective: Explore and understand the dataset through visualization and statistics to uncover patterns, detect anomalies, and generate insights before formal modeling.

Function Purpose
boxplot() Detect outliers and visualize distributions
quantile() Calculate percentile thresholds
ggplot() + geom_histogram() Visualize continuous variable distributions
geom_density() Overlay density curves
log1p() Log transformation for skewed data
cor.test() Correlation tests (Pearson/Spearman)
lm() Linear regression for R² analysis
kruskal.test() Non-parametric group comparison
geom_boxplot() Compare distributions across categories
geom_smooth() Add trend lines to scatter plots

It aims to explore and understand the dataset through summary statistics and visualization to uncover patterns, detect anomalies, and generate insights before formal modeling.

4.1 Data Integrity and Anomaly Inspection

4.1.1 Outlier Detection and Data Validity Check

This section investigates the distributions of key variables Price_Lac, Area_sqft, Parking_num to detect potential outliers and anomalies.

library(dplyr)
house_df <- read.csv("house_price_cleaned.csv",stringsAsFactors = FALSE)

# a. Distribution visualization
op <- par(mfrow = c(1, 3))
boxplot(house_df$Price_Lac, main = "Price_Lac",
ylab = "Price (Lac)", col = "salmon", border = "brown")
boxplot(house_df$Area_sqft, main = "Area_sqft",
ylab = "Area (sqft)", col = "lightblue", border = "blue")
boxplot(house_df$Parking_num, main = "Parking_num",
ylab = "Parking count", col = "lightgreen", border = "darkgreen")

par(op)

# b. Parking Anomalies detection (Top 0.5%)
park_thr <- quantile(house_df$Parking_num, 0.995, na.rm = TRUE)
anomalous_parking <- subset(house_df, !is.na(Parking_num) & Parking_num > park_thr)

cat("Parking threshold (99.5th percentile):", park_thr, "\n")
Parking threshold (99.5th percentile): 34 
cat("Rows with Parking_num above threshold:", nrow(anomalous_parking), "\n")
Rows with Parking_num above threshold: 589 
anomalous_parking %>%
  dplyr::select(location, Parking_num, Price_Lac, Area_sqft) %>%
  head() %>%
  knitr::kable(
    caption = "Anomalous Parking Records",
    digits = 0,
    format.args = list(big.mark = ","),
    row.names = FALSE)
Anomalous Parking Records
location Parking_num Price_Lac Area_sqft
thane 66 90 600
thane 701 260 1,200
thane 35 24 654
thane 323 245 1,250
thane 103 81 600
thane 203 80 600
# c. Area outliers detection (Top 0.1%)
area_thr <- quantile(house_df$Area_sqft, 0.999, na.rm = TRUE)
extreme_area <- subset(house_df, !is.na(Area_sqft) & Area_sqft > area_thr)

cat("Area threshold (99.9th percentile):", area_thr, "\n")
Area threshold (99.9th percentile): 6481.1 
cat("Number of extreme area outliers:", nrow(extreme_area), "\n")
Number of extreme area outliers: 178 
extreme_area %>%
  dplyr::arrange(desc(Area_sqft)) %>% 
  dplyr::select(location, Area_sqft, Price_Lac, Parking_num) %>%
  head(10) %>%
  knitr::kable(
    caption = "Top 10 Listings by Area (Extreme High)",
    digits = 0,
    format.args = list(big.mark = ","))
Top 10 Listings by Area (Extreme High)
location Area_sqft Price_Lac Parking_num
guwahati 709,222 60 1
visakhapatnam 530,040 105 1
bhiwadi 495,970 19 1
agra 282,004 20 1
gurgaon 194,936 260 2
agra 113,134 56 1
jaipur 107,806 68 1
thrissur 81,845 45 1
siliguri 81,675 46 1
siliguri 71,775 40 1
# d. Price outliers detection (Top 0.1%)
price_thr <- quantile(house_df$Price_Lac, 0.999, na.rm = TRUE)
extreme_prices <- subset(house_df, !is.na(Price_Lac) & Price_Lac > price_thr)

cat("Price threshold (99.9th percentile):", price_thr, "\n")
Price threshold (99.9th percentile): 1250 
cat("Number of extreme price outliers:", nrow(extreme_prices), "\n")
Number of extreme price outliers: 174 
extreme_prices %>%
  dplyr::select(location, Price_Lac, Area_sqft, Parking_num) %>%
  head() %>%
  knitr::kable(
    caption = "Extreme Price Listings",
    digits = 0,
    format.args = list(big.mark = ","),
    row.names = FALSE)
Extreme Price Listings
location Price_Lac Area_sqft Parking_num
mumbai 1,475 2,800 2
mumbai 2,200 2,300 2
mumbai 3,600 10,000 4
mumbai 3,680 15,000 6
mumbai 3,600 12,000 6
mumbai 4,000 10,000 3

For Parking_num, the 99.5th-percentile cutoff is 34 and 589 rows are above it. Values like 66, 103, 203, 323, and 701 look unrealistic for a place of residence, so they are more likely data-entry issues, placeholders, or a definition mismatch.

For Area_sqft, the 99.9th-percentile cutoff is 6481.1 and 178 rows are above it. The maximum area reaches 709,222 sqft, which is not plausible for typical residential listings and strongly suggests a unit or data-entry problem.

For Price_Lac, the 99.9th-percentile cutoff is 1250 and 174 extreme-price rows are flagged. These rows often have large Area_sqft values, such as 10,000 to 15,000. They are likely genuine luxury listings.

4.1.2 Missing and “Unknown” Values Analysis

This section evaluates data completeness by examining missing values across variables and the prevalence of the “Unknown” placeholder in categorical fields.

library(tidyverse)

# a. Missing rate by variable
missing_rate <- sapply(house_df, function(x) mean(is.na(x)))
missing_df <- data.frame(
variable = names(missing_rate),
missing_pct = as.numeric(missing_rate))
missing_df <- missing_df[missing_df$missing_pct > 0,]
missing_df <- missing_df[order(missing_df$missing_pct, decreasing = TRUE),]

ggplot(missing_df, aes(x = reorder(variable, missing_pct), y = missing_pct)) +
  geom_col(fill = "tomato") +
  geom_text(aes(label = scales::percent(missing_pct, accuracy = 0.1)),
    hjust = -0.05, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Missing value rate by variable",
    x = NULL,
    y = "Missing rate (NA)") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))

# b. "Unknown" proportion in categorical fields
unknown_cols <- c("facing", "Ownership", "Furnishing", "Transaction")

unknown_df <- data.frame(
variable = unknown_cols,
unknown_pct = sapply(unknown_cols, function(col) {
mean(tolower(house_df[[col]]) == "unknown", na.rm = TRUE)}))

unknown_df <- unknown_df[order(unknown_df$unknown_pct, decreasing = TRUE),]

ggplot(unknown_df, aes(x = reorder(variable, unknown_pct), y = unknown_pct)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = scales::percent(unknown_pct, accuracy = 0.1)), hjust = -0.05, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "\"Unknown\" proportion by categorical variable",
    x = NULL,
    y = "Proportion of \"Unknown\"") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))

Super_Area_sqft (57.13%) and Carpet_Area_sqft (42.92%) show significant missing values. They are the main sources of missing data in the dataset. The missing rate for Price_per_sqft is relatively low at 4.49%. These data gaps remain in area-related fields. This indicates lower data quality for these variables. Other fields are largely complete.

Among categorical variables, Facing and Ownership have high ‘Unknown’ rates, standing at 36.96% and 34.82% respectively. This limits the effective information despite the absence of missing values. ‘Unknown’ values are rare in Furnishing (1.16%) and Transaction (0.04%). The data quality concern is concentrated in only a few categorical fields.

4.2 Univariate Analysis: Distributions

4.2.1 Target Variable Analysis: Price

This section examines the distribution of the target variable Price_Lac. The goal is to assess its skewness and tail behavior.

library(scales)
library(ggplot2)
library(patchwork)
library(kableExtra)

# a. Price distribution
summary(house_df$Price_Lac) %>%
  broom::tidy() %>%
  knitr::kable(
    caption = "Summary Statistics: Price (Lac)", 
    digits = 2,
    align   = "c",
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(full_width = TRUE)
Summary Statistics: Price (Lac)
minimum q1 median mean q3 maximum
1 48.4 78 119.81 145 140,030
house_df$Price_Lac <- as.numeric(as.character(house_df$Price_Lac))

house_df <- house_df %>%
  mutate(log_price = log1p(Price_Lac))

median_raw <- median(house_df$Price_Lac, na.rm = TRUE)
median_log <- median(house_df$log_price, na.rm = TRUE)

# Raw scale
p99 <- quantile(house_df$Price_Lac, 0.99, na.rm = TRUE)

p1 <- ggplot(house_df, aes(x = Price_Lac)) +
 
  geom_histogram(aes(y = after_stat(density)), bins = 50, 
                 fill = "black", color = "white", alpha = 0.8, na.rm = TRUE) +

  geom_density(color = "darkblue", linewidth = 1, na.rm = TRUE) +

  geom_vline(xintercept = median_raw, linetype = "dashed", color = "tomato", linewidth = 0.8) +
  annotate("text", x = median_raw, y = 0, 
           label = paste("Median:", format(round(median_raw, 1), big.mark=",")), 
           vjust = -1, hjust = -0.1, color = "tomato", fontface = "bold", size = 3.5) +
  coord_cartesian(xlim = c(0, p99)) +
  labs(title = "Original Scale (Zoomed top 99%)", 
       x = "Price (Lac)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))

# Log transformed scale
p2 <- ggplot(house_df, aes(x = log_price)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, 
                 fill = "lightgreen", color = "white", alpha = 0.8, na.rm = TRUE) +
  geom_density(color = "steelblue", linewidth = 1, na.rm = TRUE) +
  geom_vline(xintercept = median_log, linetype = "dashed", color = "tomato", linewidth = 0.8) +
  annotate("text", x = median_log, y = 0, 
           label = paste("Median:", round(median_log, 2)), 
           vjust = -1, hjust = -0.1, color = "tomato", fontface = "bold", size = 3.5) +
  labs(title = "Log Transformed Scale", 
       x = "log1p(Price)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))

p1 + p2

Given the extremely long right tail on the raw scale (Price_Lac max = 140,030), applying log1p(Price_Lac) compresses extreme values. This results in a much more symmetric distribution. The log view effectively mitigates the skewness caused by very expensive listings.

4.2.2 Key Numerical Features

This section analyzes Area_sqft and Price_per_sqft. The focus is on identifying skewness, heavy tails, and potential anomalies to reveal the data’s spread and extreme behaviors.

# a. Area_sqft distribution
summary(house_df$Area_sqft) %>%
  broom::tidy() %>%
  knitr::kable(
    caption = "Summary Statistics: Area_sqft",
    digits = 0,
    align   = "c",
    format.args = list(big.mark = ","))%>%
  kable_styling(full_width = TRUE)
Summary Statistics: Area_sqft
minimum q1 median mean q3 maximum na
1 833 1,150 1,269 1,570 709,222 90
quantile(house_df$Area_sqft, probs = c(0.5, 0.95, 0.99, 0.999), na.rm = TRUE) %>%
  t() %>%
  as.data.frame() %>% 
  knitr::kable(
    caption = "Quantiles: Area_sqft",
    digits = 0,
    align   = "c",
    format.args = list(big.mark = ","))%>%
  kable_styling(full_width = TRUE)
Quantiles: Area_sqft
50% 95% 99% 99.9%
1,150 2,500 3,784 6,481
area_p99 <- quantile(house_df$Area_sqft, 0.99, na.rm = TRUE)
median_area <- median(house_df$Area_sqft, na.rm = TRUE)

# Distribution
p1 <- ggplot(house_df, aes(x = Area_sqft)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60,
                 fill = "black", color = "white", alpha = 0.7, na.rm = TRUE) +
  geom_density(color = "navy", linewidth = 1, na.rm = TRUE) +
  geom_vline(xintercept = median_area, linetype = "dashed", color = "red") +
  coord_cartesian(xlim = c(0, area_p99)) +
  labs(title = "Distribution",
       x = "Area (sqft)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))

# Density & Spread
p2 <- ggplot(house_df, aes(x = "", y = Area_sqft)) +
  geom_violin(fill = "lightblue", color = "steelblue", alpha = 0.5, trim = TRUE) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +

  coord_cartesian(ylim = c(0, area_p99)) +
  labs(title = "Density & Spread",
       x = NULL, y = "Area (sqft)") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    axis.text.x = element_blank())

p1 + p2

Area_sqft is right-skewed, with a median around 1,150 and a very long upper tail (99.9th percentile ≈ 6,481, max ≈ 709,222), which suggests a mix of genuine large properties and potential unit or data-entry anomalies.

# b. Price_per_sqft distribution
summary(house_df$Price_per_sqft) %>%
  broom::tidy() %>%
  knitr::kable(
    caption = "Summary Statistics: Price_per_sqft",
    digits = 0,
    align   = "c",
    format.args = list(big.mark = ","))%>%
  kable_styling(full_width = TRUE)
Summary Statistics: Price_per_sqft
minimum q1 median mean q3 maximum na
0 4,297 6,034 7,584 9,450 6,700,000 7,981
quantile(house_df$Price_per_sqft, probs = c(0.5, 0.95, 0.99, 0.999), na.rm = TRUE) %>%
  t() %>%
  knitr::kable(
    caption = "Quantiles: Price_per_sqft",
    digits = 0,
    align   = "c",
    format.args = list(big.mark = ","))%>%
  kable_styling(full_width = TRUE)
Quantiles: Price_per_sqft
50% 95% 99% 99.9%
6,034 16,111 23,810 46,429
pps_p99 <- quantile(house_df$Price_per_sqft, 0.99, na.rm = TRUE)
median_pps <- median(house_df$Price_per_sqft, na.rm = TRUE)

# Distribution
p1 <- ggplot(house_df, aes(x = Price_per_sqft)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60, fill = "seagreen", color = "white", alpha = 0.7, na.rm = TRUE) +
  geom_density(color = "darkgreen", linewidth = 1, na.rm = TRUE) +
  geom_vline(xintercept = median_pps, linetype = "dashed", color = "darkorange") +
  annotate("text", x = median_pps, y = 0, label = paste("Median:", comma(median_pps)), vjust = -1, hjust = -0.1, color = "darkorange", fontface = "bold", size = 3) +
  coord_cartesian(xlim = c(0, pps_p99)) +
  labs(title = "Distribution",
       x = "Price per Sqft", y = "Density") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = scales::label_number(accuracy = 0.0001)) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))

# Density & Spread
p2 <- ggplot(house_df, aes(x = "", y = Price_per_sqft)) +
  geom_violin(fill = "lightgreen", color = "darkgreen", alpha = 0.6, trim = TRUE, na.rm = TRUE) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA, na.rm = TRUE) +
  coord_cartesian(ylim = c(0, pps_p99)) +
  labs(title = "Density & Spread",
       x = NULL, y = "Price per Sqft") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    axis.text.x = element_blank())

p1 + p2

Price_per_sqft shows heavy tails with extreme upper values. The median is 6,034, while the maximum reaches 6,700,000. Additionally, there is non-trivial missingness with 7,981 missing entries. These factors indicate a highly skewed distribution with significant data gaps.

4.2.3 Categorical Features

This section analyzes frequency distributions of key categorical variables includinglocation, Transaction, Furnishing, facing and Ownership. The goal is to identify class imbalance and high-cardinality issues.

# a. Frequency overview
cat_vars <- c("location", "Transaction", "Furnishing", "facing", "Ownership")

cat_summary <- purrr::map_dfr(cat_vars, function(v) {
  x <- house_df[[v]]
  n_total <- length(x)
  n_unique <- dplyr::n_distinct(x)
  top_counts <- sort(table(x), decreasing = TRUE)
  top1_pct <- as.numeric(top_counts[1]) / n_total
  top3_pct <- sum(top_counts[1:min(3, length(top_counts))]) / n_total
  unknown_pct <- mean(tolower(as.character(x)) == "unknown", na.rm = TRUE)
  
  tibble::tibble(
    variable = v,
    n_unique = n_unique,
    top1_pct = top1_pct,
    top3_pct = top3_pct,
    unknown_pct = unknown_pct)})

cat_summary %>%
  mutate(across(ends_with("_pct"), ~ scales::percent(.x, accuracy = 0.1))) %>%
  knitr::kable(
    caption = "Summary of Categorical Variables",
    align = "lcccc")
Summary of Categorical Variables
variable n_unique top1_pct top3_pct unknown_pct
location 81 14.0% 39.3% 0.0%
Transaction 5 76.3% 100.0% 0.0%
Furnishing 4 46.6% 98.8% 1.2%
facing 9 37.0% 79.5% 37.0%
Ownership 5 59.8% 97.5% 34.8%

location shows an uneven distribution. While there are 81 unique cities, the top three dominate with 39.3% coverage.

Transaction and Ownership show strong class concentration. A single category dominates each variable (76.3% and 59.8% respectively).

facing is notable for its high “Unknown” rate of 37.0%. This makes the placeholder the most frequent category. Ownership also contains significant placeholders (34.8%).

# b. Location (Top-N) distribution
house_df_base <- house_df %>%
  mutate(
    Price_Lac = as.numeric(Price_Lac),
    log_price = log1p(Price_Lac)
  ) %>%
  filter(!is.na(Price_Lac), Price_Lac >= 0)

# Identify Top 20 Locations
top_n <- 20

loc_top20_levels <- house_df_base %>%
  count(location, sort = TRUE) %>%
  slice_head(n = top_n) %>%
  pull(location)

# Calculate Distribution
loc_counts <- house_df_base %>%
  mutate(location_top20 = if_else(location %in% loc_top20_levels, location, "Other")) %>%
  count(location_top20, sort = TRUE) %>%
  mutate(pct = n / sum(n)) 

loc_plot_df <- loc_counts %>%
  mutate(location_top20 = reorder(location_top20, pct))

# Visualization
ggplot(loc_plot_df, aes(x = location_top20, y = pct)) +
  geom_col(fill = "steelblue") +
  geom_text(
    aes(label = percent(pct, accuracy = 0.1)),
    hjust = -0.05, 
    size = 3) +
  coord_flip() +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = paste0("Top ", top_n, " locations (others grouped)"),
    x = NULL,
    y = "Share of listings") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))

The top five locations make up roughly 56.9% of the total, showing that the data is heavily concentrated rather than evenly distributed.

# c. Low-cardinality categorical variables
plot_cat_pct <- function(df, col) {
tmp <- df %>%
count(.data[[col]], sort = TRUE) %>%
mutate(pct = n / sum(n)) %>%
rename(level = 1)

ggplot(tmp, aes(x = reorder(level, pct), y = pct)) +
  geom_col(fill = "tomato") +
  geom_text(aes(label = percent(pct, accuracy = 0.1)), hjust = -0.05, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = percent_format(accuracy = 1),
expand = expansion(mult = c(0, 0.3))) +
  labs(
  title = paste0(col, " distribution"),
  x = NULL,
  y = "Share") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5))}

p1 <- plot_cat_pct(house_df, "Transaction")
p2 <- plot_cat_pct(house_df, "Furnishing")
p3 <- plot_cat_pct(house_df, "facing")
p4 <- plot_cat_pct(house_df, "Ownership")

(p1 + p2) / (p3 + p4)

Among low-cardinality variables, Transaction is highly imbalanced. “Resale” dominates with 76.3%, and the top three categories cover the entire dataset. Furnishing is similarly concentrated but contains rare “Unknown” entries accounting for 1.2%. facing and Ownership show high “Unknown” rates of 37.0% and 34.8% respectively. While these variables appear complete, they actually contain substantial placeholder information.

4.3 Bivariate Analysis: Determinants of Price

4.3.2 Price vs Discrete Features

This section investigates Current_Floor, Total_Floors, Bathroom_num, Balcony_num, and Parking_num to see if these discrete features lead to systematic price variations. The goal is to identify potential premiums for higher counts and understand the price consistency within each category.

# a. Grouping strategy
house_df <- house_df %>%
  mutate(
    Price_Lac = as.numeric(Price_Lac),
    Area_sqft = as.numeric(Area_sqft),
    Current_Floor = as.numeric(Current_Floor),
    Total_Floors = as.numeric(Total_Floors),
    Bathroom_num = as.numeric(Bathroom_num),
    Balcony_num = as.numeric(Balcony_num),
    Parking_num = as.numeric(Parking_num)
  ) %>%
  filter(!is.na(Price_Lac), !is.na(Area_sqft), Price_Lac >= 0, Area_sqft > 0) %>%
  mutate(log_price = log1p(Price_Lac)) %>%
  mutate(
    Bathroom_grp = case_when(
      is.na(Bathroom_num) ~ NA_character_,
      Bathroom_num >= 5 ~ ">=5",
      TRUE ~ as.character(as.integer(Bathroom_num))),
    Balcony_grp = case_when(
      is.na(Balcony_num) ~ NA_character_,
      Balcony_num >= 4 ~ ">=4",
      TRUE ~ as.character(as.integer(Balcony_num))),
    Parking_capped = case_when(
      is.na(Parking_num) ~ NA_real_,
      Parking_num > 10 ~ 10,
      TRUE ~ Parking_num),
    Parking_grp = case_when(
      is.na(Parking_capped) ~ NA_character_,
      Parking_capped >= 3 ~ ">=3",
      TRUE ~ as.character(as.integer(Parking_capped))),
    CurrentFloor_grp = case_when(
      is.na(Current_Floor) ~ NA_character_,
      Current_Floor == 0 ~ "0",
      Current_Floor <= 2 ~ "1-2",
      Current_Floor <= 5 ~ "3-5",
      Current_Floor <= 10 ~ "6-10",
      Current_Floor <= 20 ~ "11-20",
      TRUE ~ ">20"),
    TotalFloors_grp = case_when(
      is.na(Total_Floors) ~ NA_character_,
      Total_Floors <= 5 ~ "<=5",
      Total_Floors <= 10 ~ "6-10",
      Total_Floors <= 20 ~ "11-20",
      TRUE ~ ">20"),
    Floor_Ratio = if_else(!is.na(Current_Floor) & !is.na(Total_Floors) & Total_Floors > 0, 
                          pmin(pmax(Current_Floor / Total_Floors, 0), 1), NA_real_),
    FloorPos_grp = case_when(
      is.na(Floor_Ratio) ~ NA_character_,
      Floor_Ratio <= 0.30 ~ "Bottom",
      Floor_Ratio <= 0.75 ~ "Middle",
      TRUE ~ "Top")
  ) %>%
  mutate(
    Bathroom_grp = factor(Bathroom_grp, levels = c("1", "2", "3", "4", ">=5")),
    Balcony_grp  = factor(Balcony_grp,  levels = c("1", "2", "3", ">=4")),
    Parking_grp  = factor(Parking_grp,  levels = c("1", "2", ">=3")),
    CurrentFloor_grp = factor(CurrentFloor_grp, levels = c("0","1-2","3-5","6-10","11-20",">20")),
    TotalFloors_grp  = factor(TotalFloors_grp,  levels = c("<=5","6-10","11-20",">20")),
    FloorPos_grp = factor(FloorPos_grp, levels = c("Bottom", "Middle", "Top")))

plot_group_dist <- function(col_name, plot_title, x_label) {
  house_df %>%
    filter(!is.na(.data[[col_name]])) %>%
    count(.data[[col_name]]) %>%
    ggplot(aes(x = .data[[col_name]], y = n)) +
    geom_col(fill = "steelblue") +
    geom_text(aes(label = comma(n)), vjust = -0.5, size = 3) +
    labs(title = plot_title, x = x_label, y = "Count") +
    scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.2))) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
      axis.text.x = element_text(angle = 0))}

p1 <- plot_group_dist("Bathroom_grp", "Bathrooms", "Number of Bathrooms")
p2 <- plot_group_dist("Balcony_grp", "Balconies", "Number of Balconies")
p3 <- plot_group_dist("Parking_grp", "Parking Spaces", "Number of Parking Spots")
p4 <- plot_group_dist("FloorPos_grp", "Floor Position", "Relative Floor Level")

(p1 + p2) / (p3 + p4)

Key price, area, floor, and amenity fields were coerced to numeric types, and observations containing missing or invalid values were excluded.Bathrooms were grouped into categories of 1–4 and \(\ge 5\), and balconies into 1–3 and \(\ge 4\). Parking spaces were capped and categorized into 1, 2, and \(\ge 3\).Current and total floors were grouped into 0, 1–2, 3–5, 6–10, 11–20, and >20, while the ratio of current to total floors was classified into Bottom, Middle, and Top categories. Finally, all grouped variables were cast as ordered factors.

# b. Price vs Bathrooms
price_p99 <- quantile(house_df$Price_Lac, 0.99, na.rm = TRUE)

# Raw Scale
p_raw <- ggplot(house_df %>% filter(!is.na(Bathroom_grp)),
                aes(x = Bathroom_grp, y = Price_Lac, fill = Bathroom_grp)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "white") +
  coord_cartesian(ylim = c(0, price_p99)) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Price by Bathrooms (Raw Scale)",
       x = "Number of Bathrooms",
       y = "Price (Lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Log Scale
p_log <- ggplot(house_df %>% filter(!is.na(Bathroom_grp)), 
                aes(x = Bathroom_grp, y = log_price, fill = Bathroom_grp)) +
  geom_boxplot(outlier.alpha = 0.3, alpha = 0.7) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Log Price by Bathrooms",
       x = "Number of Bathrooms",
       y = "log1p(Price)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

p_raw + p_log

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ Bathroom_grp, data = house_df) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Bathroom Group",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Bathroom Group
\(\chi^2\) DF \(P\)
86,635.5 4 0
# Summary Table
house_df %>%
  filter(!is.na(Bathroom_grp)) %>%
  group_by(Bathroom_grp) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  knitr::kable(
    caption = "Price Distribution by Bathroom Group",
    col.names = c("Bathroom Group", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Distribution by Bathroom Group
Bathroom Group Count Median Price (Lac) Median Log Price
1 15,494 25 3.258
2 89,709 60 4.111
3 54,166 121 4.804
4 14,823 235 5.464
>=5 3,565 500 6.217

The raw-scale boxplot (truncated at P99) reveals a distinct stepwise price increase, with medians rising from 25 Lac (1 bath) to 500 Lac (≥5 baths). This positive trend remains robust on the log scale, indicating it is not driven by outliers.The Kruskal–Wallis test is highly significant (χ² ≈ 86635.50,P ≪ 0.001), confirming statistically meaningful distribution differences and supporting bathrooms as an important discrete determinant of price.

# c. Price vs Balconies
price_p99 <- quantile(house_df$Price_Lac, 0.99, na.rm = TRUE)

# Raw Scale
p_raw <- ggplot(house_df %>% filter(!is.na(Balcony_grp)),
                    aes(x = Balcony_grp, y = Price_Lac, fill = Balcony_grp)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "white") +
  coord_cartesian(ylim = c(0, price_p99)) +
  scale_fill_brewer(palette = "Greens") +
  labs(title = "Price by Balconies (Raw Scale)",
       x = "Number of Balconies",
       y = "Price (Lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Log Scale
p_log <- ggplot(house_df %>% filter(!is.na(Balcony_grp)),
                    aes(x = Balcony_grp, y = log_price, fill = Balcony_grp)) +
  geom_boxplot(outlier.alpha = 0.3, alpha = 0.7) +
  scale_fill_brewer(palette = "Greens") +
  labs(title = "Log Price by Balconies",
       x = "Number of Balconies",
       y = "log1p(Price)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

p_raw + p_log

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ Balcony_grp, data = house_df) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Balcony Group",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Balcony Group
\(\chi^2\) DF \(P\)
23,750.15 3 0
# Summary Table
house_df %>%
  filter(!is.na(Balcony_grp)) %>%
  group_by(Balcony_grp) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  knitr::kable(
    caption = "Price Distribution by Balcony Group",
    col.names = c("Balcony Group", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Distribution by Balcony Group
Balcony Group Count Median Price (Lac) Median Log Price
1 45,125 55 4.025
2 97,410 78 4.369
3 24,953 135 4.913
>=4 10,269 210 5.352

The raw-scale boxplot (truncated at P99) shows price distributions shifting upward with more balconies, with medians rising from 55 Lac (1 balcony) to 210 Lac (≥4). This positive association remains robust on the log scale, unaffected by outliers. A significant Kruskal–Wallis test (χ² ≈ 23750.15, P ≪ 0.001) confirms statistically significant group differences, supporting balcony count as an informative indicator of price levels.

# d. Price vs Parking
price_p99 <- quantile(house_df$Price_Lac, 0.99, na.rm = TRUE)

# Raw Scale
p_raw <- ggplot(house_df %>% filter(!is.na(Parking_grp)),
                     aes(x = Parking_grp, y = Price_Lac, fill = Parking_grp)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "white") +
  coord_cartesian(ylim = c(0, price_p99)) +
  scale_fill_brewer(palette = "Purples") +
  labs(title = "Price by Parking Spaces (Raw Scale)",
       x = "Number of Parking Spots",
       y = "Price (lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Log Scale
p_log <- ggplot(house_df %>% filter(!is.na(Parking_grp)),
                     aes(x = Parking_grp, y = log_price, fill = Parking_grp)) +
  geom_boxplot(outlier.alpha = 0.3, alpha = 0.7) +
  scale_fill_brewer(palette = "Purples") +
  labs(title = "Log Price by Parking Spaces",
       x = "Number of Parking Spots",
       y = "log1p(Price)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

p_raw + p_log

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ Parking_grp, data = house_df) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Parking Group",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Parking Group
\(\chi^2\) DF \(P\)
23,466.89 2 0
# Summary Table
house_df %>%
  filter(!is.na(Parking_grp)) %>%
  group_by(Parking_grp) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  knitr::kable(
    caption = "Price Distribution by Parking Group",
    col.names = c("Parking Group", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Distribution by Parking Group
Parking Group Count Median Price (Lac) Median Log Price
1 158,178 70 4.263
2 16,546 270 5.602
>=3 3,033 58 4.078

The raw-scale boxplot (truncated at P99) reveals pronounced but non-monotonic price differences where medians peak at 2 spots (270 Lac), significantly exceeding both the 1 spot (70 Lac) and \(\ge 3\) spots (58 Lac) groups. This pattern remains robust on the log scale, although the lower median for the ≥3 group suggests potential sample heterogeneity requiring cautious interpretation. A significant Kruskal–Wallis test (\(\chi^2 \approx 23466.89\), \(P \ll 0.001\)) confirms that parking availability is a statistically significant indicator of price distribution.

# e. Price vs Floors
# Current Floor
p_curr <- ggplot(house_df %>% filter(!is.na(CurrentFloor_grp)),
                 aes(x = CurrentFloor_grp, y = log_price, fill = CurrentFloor_grp)) +
  geom_boxplot(outlier.alpha = 0.3, alpha = 0.7) +
  scale_fill_brewer(palette = "Oranges") +
  labs(title = "Log Price by Current Floor",
       x = "Current Floor Level",
       y = "log1p(Price)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Total Floors
p_tot <- ggplot(house_df %>% filter(!is.na(TotalFloors_grp)),
                aes(x = TotalFloors_grp, y = log_price, fill = TotalFloors_grp)) +
  geom_boxplot(outlier.alpha = 0.3, alpha = 0.7) +
  scale_fill_brewer(palette = "Oranges") +
  labs(title = "Log Price by Total Floors",
       x = "Total Floors Category",
       y = NULL) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Floor Position
p_pos <- ggplot(house_df %>% filter(!is.na(FloorPos_grp)),
                aes(x = FloorPos_grp, y = log_price, fill = FloorPos_grp)) +
  geom_boxplot(outlier.alpha = 0.3, alpha = 0.7) +
  scale_fill_brewer(palette = "Oranges") +
  labs(title = "Log Price by Position",
       x = "Relative Position",
       y = NULL) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5), 
    legend.position = "none")

p_curr + p_tot + p_pos

# Kruskal-Wallis Test (Combined)
kw_curr <- kruskal.test(Price_Lac ~ CurrentFloor_grp, data = house_df)
kw_tot  <- kruskal.test(Price_Lac ~ TotalFloors_grp, data = house_df)
kw_pos  <- kruskal.test(Price_Lac ~ FloorPos_grp, data = house_df)

dplyr::bind_rows(
  broom::tidy(kw_curr) %>% mutate(Variable = "Current Floor"),
  broom::tidy(kw_tot)  %>% mutate(Variable = "Total Floors"),
  broom::tidy(kw_pos)  %>% mutate(Variable = "Floor Position")) %>%
  dplyr::select(Variable, statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Floor Variables",
    col.names = c("Variable", "$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Floor Variables
Variable \(\chi^2\) DF \(P\)
Current Floor 17,372.029 5 0
Total Floors 17,642.590 3 0
Floor Position 556.269 2 0
# Current Floor Summary
house_df %>%
  filter(!is.na(CurrentFloor_grp)) %>%
  group_by(CurrentFloor_grp) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  knitr::kable(
    caption = "Price Distribution by Current Floor Group",
    col.names = c("Current Floor Grp", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Distribution by Current Floor Group
Current Floor Grp Count Median Price (Lac) Median Log Price
0 10,577 71 4.277
1-2 61,950 65 4.190
3-5 53,971 69 4.248
6-10 26,624 115 4.754
11-20 15,633 120 4.796
>20 1,704 310 5.740
# Total Floors Summary
house_df %>%
  filter(!is.na(TotalFloors_grp)) %>%
  group_by(TotalFloors_grp) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  knitr::kable(
    caption = "Price Distribution by Total Floors Group",
    col.names = c("Total Floors Grp", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Distribution by Total Floors Group
Total Floors Grp Count Median Price (Lac) Median Log Price
<=5 85,888 61 4.127
6-10 38,640 85 4.454
11-20 33,472 90 4.511
>20 12,459 170 5.142
# Floor Position Summary
house_df %>%
  filter(!is.na(FloorPos_grp)) %>%
  group_by(FloorPos_grp) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  knitr::kable(
    caption = "Price Distribution by Floor Position",
    col.names = c("Floor Position Grp", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Distribution by Floor Position
Floor Position Grp Count Median Price (Lac) Median Log Price
Bottom 45,000 75 4.331
Middle 86,355 80 4.394
Top 39,104 72 4.290

Log-scale boxplots reveal clear price stratification where median prices rise from 65–71 Lac for low floors to 115–120 Lac for mid-high floors, jumping to 310 Lac for the >20 group. Total building height likewise exhibits a consistent high-rise premium with medians increasing from 61 Lac (≤5) to 170 Lac (>20). Highly significant Kruskal–Wallis tests for Current Floor (χ² ≈ 17372.03) and Total Floors (χ² ≈ 17642.59) confirm the statistical impact of height on price. In contrast, Floor Position shows significant but narrower median gaps (72–80 Lac, χ² ≈ 556.27), suggesting the premium is driven more by absolute height and scarcity than by relative position.

4.3.3 Price vs Categorical Variables

This section investigates systematic price differences across categorical variables. Using log1p(Price_Lac) distributions, Kruskal–Wallis tests, and median comparisons, price levels associated with market segmentation factors are identified. location_top20 serves as the primary baseline, while Transaction and Furnishing are utilized to assess specific premiums. Due to high “Unknown” rates in facing (36.96%) and Ownership (34.82%), results for these variables are reported as supplementary, interpreting “Unknown” as missing data rather than a valid business category.

# a. Location effect: Top 20 + Other
df_loc <- house_df_base %>%
  mutate(location_top20 = if_else(location %in% loc_top20_levels, location, "Other"))

loc_order <- df_loc %>%
  group_by(location_top20) %>%
  summarise(med_log = median(log_price, na.rm = TRUE), .groups = "drop") %>%
  arrange(med_log) %>%
  pull(location_top20)

# Visualization
df_loc %>%
  mutate(location_top20 = factor(location_top20, levels = loc_order)) %>%
  ggplot(aes(x = location_top20, y = log_price, fill = location_top20)) +
  geom_boxplot(outlier.alpha = 0.2, alpha = 0.75) +
  coord_flip() +
  labs(
    title = "log1p(Price) by Location (Top 20 + Other)",
    x = NULL,
    y = "log1p(Price_Lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ location_top20, data = df_loc) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Location (Top20+Other)",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Location (Top20+Other)
\(\chi^2\) DF \(P\)
35,054.12 20 0
# Summary Table
df_loc %>%
  group_by(location_top20) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  mutate(pct = n / sum(n)) %>%
  arrange(desc(n)) %>%
  mutate(pct = scales::percent(pct, accuracy = 0.1)) %>%
  knitr::kable(
    caption = "Location (Top20+Other): Share and Price Summary",
    col.names = c("Location Group", "Count", "Median Price (Lac)", "Median Log Price", "Share"),
    digits = c(0, 0, 1, 3, 0),
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Location (Top20+Other): Share and Price Summary
Location Group Count Median Price (Lac) Median Log Price Share
new-delhi 24,945 160.0 5.081 14.0%
bangalore 23,262 85.0 4.454 13.1%
kolkata 21,605 51.0 3.951 12.1%
Other 21,439 53.0 3.989 12.1%
gurgaon 18,846 157.0 5.063 10.6%
ahmedabad 12,614 70.0 4.263 7.1%
hyderabad 11,147 81.8 4.416 6.3%
chennai 10,163 69.0 4.248 5.7%
jaipur 7,867 51.0 3.951 4.4%
greater-noida 4,490 64.0 4.174 2.5%
faridabad 3,733 75.0 4.331 2.1%
vadodara 2,361 39.0 3.689 1.3%
surat 2,180 56.9 4.058 1.2%
pune 2,177 70.0 4.263 1.2%
thane 1,869 70.0 4.263 1.1%
mumbai 1,814 210.0 5.352 1.0%
visakhapatnam 1,729 55.0 4.025 1.0%
mohali 1,479 65.0 4.190 0.8%
zirakpur 1,478 63.0 4.159 0.8%
chandigarh 1,402 72.0 4.290 0.8%
noida 1,247 85.0 4.454 0.7%

location_top20 (Top20+Other) strongly differentiates prices as confirmed by a significant Kruskal–Wallis test (χ² ≈ 35054.12, P ≪ 0.001). Within the Top 20, medians range widely from 210 Lac in Mumbai to 39 Lac in Vadodara, reflecting a substantial city-tier premium. The “Other” category, with a 12.1% share and 53 Lac median, aggregates smaller cities to represent a group average rather than specific local market dynamics.

# b. Transaction effect
tx_order <- house_df %>%
  filter(!is.na(Transaction)) %>%
  group_by(Transaction) %>%
  summarise(med_log = median(log_price, na.rm = TRUE), .groups = "drop") %>%
  arrange(med_log) %>%
  pull(Transaction)

# Visualization
house_df %>%
  filter(!is.na(Transaction)) %>%
  mutate(Transaction = factor(Transaction, levels = tx_order)) %>%
  ggplot(aes(x = Transaction, y = log_price, fill = Transaction)) +
  geom_boxplot(outlier.alpha = 0.2, alpha = 0.75) +
  labs(
    title = "log1p(Price) by Transaction",
    x = "Transaction",
    y = "log1p(Price_Lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ Transaction, data = house_df) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Transaction",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Transaction
\(\chi^2\) DF \(P\)
2,941.07 4 0
# Summary Table
house_df %>%
  filter(!is.na(Transaction)) %>%
  group_by(Transaction) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  arrange(desc(n)) %>%
  knitr::kable(
    caption = "Price Summary by Transaction",
    col.names = c("Transaction", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Summary by Transaction
Transaction Count Median Price (Lac) Median Log Price
Resale 135,562 75.0 4.331
New Property 41,423 93.0 4.543
Other 703 45.0 3.829
Unknown 67 56.0 4.043
Rent/Lease 2 52.5 3.826

Transaction type significantly influences house prices as confirmed by a Kruskal–Wallis test showing statistically distinct distributions (\(\chi^2 \approx 2941.07\), \(P \ll 0.001\)). Median comparisons reveal a clear premium for new properties where New Property listings (93 Lac) generally exceed Resale (75 Lac). However, results for categories such as Other and Rent-Lease should be viewed as exploratory due to negligible sample sizes.

# c. Furnishing effect
fur_order <- house_df %>%
  filter(!is.na(Furnishing)) %>%
  group_by(Furnishing) %>%
  summarise(med_log = median(log_price, na.rm = TRUE), .groups = "drop") %>%
  arrange(med_log) %>%
  pull(Furnishing)

# Visualization
house_df %>%
  filter(!is.na(Furnishing)) %>%
  mutate(Furnishing = factor(Furnishing, levels = fur_order)) %>%
  ggplot(aes(x = Furnishing, y = log_price, fill = Furnishing)) +
  geom_boxplot(outlier.alpha = 0.2, alpha = 0.75) +
  labs(
    title = "log1p(Price) by Furnishing",
    x = "Furnishing",
    y = "log1p(Price_Lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ Furnishing, data = house_df) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Furnishing",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Furnishing
\(\chi^2\) DF \(P\)
2,984.756 3 0
# Summary Table
house_df %>%
  filter(!is.na(Furnishing)) %>%
  group_by(Furnishing) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  arrange(desc(n)) %>%
  knitr::kable(
    caption = "Price Summary by Furnishing",
    col.names = c("Furnishing", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ","))
Price Summary by Furnishing
Furnishing Count Median Price (Lac) Median Log Price
Semi-Furnished 82,832 87.6 4.484
Unfurnished 73,458 70.0 4.263
Furnished 19,404 75.0 4.331
Unknown 2,063 52.0 3.970

The Kruskal-Wallis test confirms a statistically significant difference in house price distributions across furnishing statuses (\(\chi^2 \approx 2984.756\), \(P \ll 0.001\)). Regarding median prices, semi-furnished properties rank highest at approximately 87.5 Lac, followed by furnished units at 75 Lac and unfurnished ones at 70 Lac, indicating a clear premium associated with furnishing configurations. In contrast, the unknown category presents a notably lower median of 52 Lac. Given its smaller sample size (n=2067), this likely reflects heterogeneity due to missing information.

# d. Facing effect
facing_order <- house_df %>%
  filter(!is.na(facing)) %>%
  group_by(facing) %>%
  summarise(med_log = median(log_price, na.rm = TRUE), .groups = "drop") %>%
  arrange(med_log) %>%
  pull(facing)

# Visualization
house_df %>%
  filter(!is.na(facing)) %>%
  mutate(facing = factor(facing, levels = facing_order)) %>%
  ggplot(aes(x = facing, y = log_price, fill = facing)) +
  geom_boxplot(outlier.alpha = 0.2, alpha = 0.75) +
  labs(
    title = "log1p(Price) by Facing",
    x = "Facing",
    y = "log1p(Price_Lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ facing, data = house_df) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Facing",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Facing
\(\chi^2\) DF \(P\)
14,554.94 8 0
# Summary Table
house_df %>%
  filter(!is.na(facing)) %>%
  group_by(facing) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop") %>%
  arrange(desc(n)) %>%
  knitr::kable(
    caption = "Price Summary by Facing",
    col.names = c("Facing", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Price Summary by Facing
Facing Count Median Price (Lac) Median Log Price
Unknown 65,659 62.0 4.143
East 52,238 85.0 4.454
North - East 23,348 121.0 4.804
North 15,301 83.0 4.431
West 8,471 75.0 4.331
South 4,315 136.0 4.920
North - West 3,803 160.0 5.081
South - East 2,570 65.5 4.197
South -West 2,052 78.0 4.369
# Kruskal-Wallis Test (excluding 'Unknown')
kruskal.test(Price_Lac ~ facing, 
             data = house_df %>% filter(!is.na(facing), facing != "Unknown")) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Facing (Excluding Unknown)",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Facing (Excluding Unknown)
\(\chi^2\) DF \(P\)
3,677.443 7 0

Due to the high Unknown rate of 36.96% in facing, this category is treated as missing information and the results are reported as supplementary. The Kruskal-Wallis test reveals significant differences in price distributions across orientations (\(\chi^2 \approx 14554.94\), \(P \ll 0.001\)), and these differences remain significant even after excluding the Unknown category (\(\chi^2 \approx 3677.443\), \(P \ll 0.001\)). Median prices are higher for North-West and South facing properties at approximately 160 Lac and 136 Lac respectively, while the Unknown group sits considerably lower at 62 Lac.

# e.Ownership effect
own_order <- house_df %>%
  filter(!is.na(Ownership)) %>%
  group_by(Ownership) %>%
  summarise(med_log = median(log_price, na.rm = TRUE), .groups = "drop") %>%
  arrange(med_log) %>%
  pull(Ownership)

# Visualization
house_df %>%
  filter(!is.na(Ownership)) %>%
  mutate(Ownership = factor(Ownership, levels = own_order)) %>%
  ggplot(aes(x = Ownership, y = log_price, fill = Ownership)) +
  geom_boxplot(outlier.alpha = 0.2, alpha = 0.75) +
  coord_flip() +
  labs(
    title = "log1p(Price) by Ownership",
    x = "Ownership",
    y = "log1p(Price_Lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    legend.position = "none")

# Kruskal-Wallis Test
kruskal.test(Price_Lac ~ Ownership, data = house_df) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Ownership",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Ownership
\(\chi^2\) DF \(P\)
12,582.87 4 0
# Summary Table
house_df %>%
  filter(!is.na(Ownership)) %>%
  group_by(Ownership) %>%
  summarise(
    n = n(),
    median_price = median(Price_Lac, na.rm = TRUE),
    median_log_price = median(log_price, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(n)) %>%
  knitr::kable(
    caption = "Price Summary by Ownership",
    col.names = c("Ownership", "Count", "Median Price (Lac)", "Median Log Price"),
    digits = c(0, 0, 1, 3),
    format.args = list(big.mark = ",")
  ) %>%
  kableExtra::kable_styling(full_width = TRUE)
Price Summary by Ownership
Ownership Count Median Price (Lac) Median Log Price
Freehold 106,397 88.0 4.489
Unknown 61,855 64.0 4.174
Leasehold 5,149 91.8 4.530
Co-operative Society 3,344 55.0 4.025
Power Of Attorney 1,012 163.0 5.100
# Kruskal-Wallis Test (Excluding 'Unknown')
kruskal.test(Price_Lac ~ Ownership, 
             data = house_df %>% filter(!is.na(Ownership), Ownership != "Unknown")) %>%
  broom::tidy() %>%
  dplyr::select(statistic, parameter, p.value) %>%
  knitr::kable(
    caption = "Kruskal-Wallis Test: Price by Ownership",
    col.names = c("$\\chi^2$", "DF", "$P$"),
    escape = FALSE,
    digits = 3,
    align = "c",
    format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = TRUE)
Kruskal-Wallis Test: Price by Ownership
\(\chi^2\) DF \(P\)
961.755 3 0

Due to the high Unknown rate of 34.82% in Ownership, this category is treated as missing information and the results are reported as supplementary. The Kruskal-Wallis test reveals significant differences in price distributions across ownership types (\(\chi^2 \approx 12582.87\), \(P \ll 0.001\)), and these differences remain significant even after excluding the Unknown category (\(\chi^2 \approx 961.755\), \(P \ll 0.001\)). Median prices are higher for Freehold and Leasehold properties, at 88 Lac and 91.8 Lac respectively, while Co-operative Society units are lower at approximately 55 Lac.

4.4 Multivariate Relationships

4.4.1 Correlation Matrix

This section employs a correlation matrix heatmap to summarize linear relationships among key numerical variables such as Price_Lac, Price_per_sqft, and various area or floor measures. The goal is to detect highly correlated feature groups that indicate redundancy or multicollinearity risks and to inform subsequent feature engineering decisions.

library(ggcorrplot)

key_numeric_vars <- c(
  "Price_Lac", "Price_per_sqft",
  "Area_sqft", "Carpet_Area_sqft", "Super_Area_sqft",
  "Current_Floor", "Total_Floors",
  "Bathroom_num", "Balcony_num", "Parking_num")

num_df_key <- house_df %>%
  dplyr::select(where(is.numeric)) %>%
  dplyr::select(any_of(key_numeric_vars), everything())

# a. Clustered correlation heatmap
corr_mat <- cor(num_df_key, use = "pairwise.complete.obs", method = "pearson")
corr_mat[is.na(corr_mat)] <- 0

ggcorrplot::ggcorrplot(
  corr_mat,
  type = "lower",
  hc.order = TRUE,
  lab = TRUE,
  lab_size = 3,
  outline.col = "white",
  colors = c("steelblue", "white", "tomato"),
  show.diag = FALSE) +
  labs(title = "Clustered Correlation Heatmap") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    axis.text.x = element_text(
      angle = 45,
      hjust = 1, vjust = 1,
      margin = margin(t = 2)))

# b. Feature correlations table
corr_long <- as.data.frame(as.table(corr_mat)) %>%
  dplyr::rename(var1 = Var1, var2 = Var2, corr = Freq) %>%
  dplyr::filter(var1 != var2) %>%
  dplyr::mutate(abs_corr = abs(corr)) %>%
  dplyr::arrange(desc(abs_corr))

top_pairs <- corr_long %>%
  dplyr::rowwise() %>%
  dplyr::mutate(pair_id = paste(sort(c(as.character(var1), as.character(var2))), collapse = " | ")) %>%
  dplyr::ungroup() %>%
  dplyr::distinct(pair_id, .keep_all = TRUE) %>%
  dplyr::select(var1, var2, corr, abs_corr) %>%
  dplyr::slice_head(n = 10)

top_pairs %>%
  dplyr::mutate(Interpretation = dplyr::case_when(
    abs_corr >= 0.9 ~ "Redundant (Consider Dropping)",
    abs_corr >= 0.7 ~ "Strong Linkage",
    abs_corr >= 0.4 ~ "Moderate Correlation",
    TRUE ~ "Weak Correlation")) %>%
  knitr::kable(
    caption = "Top 10 Strongest Feature Correlations",
     col.names = c("Variable A", "Variable B", "$r$", "$\\vert r \\vert$", "Interpretation"),
    escape = FALSE,
    digits = 3,
    align = "llrrr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Top 10 Strongest Feature Correlations
Variable A Variable B \(r\) \(\vert r \vert\) Interpretation
Carpet_Area_sqft Area_sqft 1.000 1.000 Redundant (Consider Dropping)
Super_Area_sqft Area_sqft 1.000 1.000 Redundant (Consider Dropping)
Price_per_sqft Price_Lac 0.763 0.763 Strong Linkage
Total_Floors Current_Floor 0.734 0.734 Strong Linkage
log_price Bathroom_num 0.718 0.718 Strong Linkage
Parking_capped Parking_num 0.509 0.509 Moderate Correlation
Floor_Ratio Current_Floor 0.452 0.452 Moderate Correlation
log_price Balcony_num 0.362 0.362 Weak Correlation
log_price Total_Floors 0.356 0.356 Weak Correlation
Balcony_num Bathroom_num 0.341 0.341 Weak Correlation

The heatmap and the Top-10 table jointly demonstrate that the three area measures are effectively duplicates, as Area_sqft correlates perfectly with both Super_Area_sqft and Carpet_Area_sqft (r=1.000). Price signals are also strongly related as Price_per_sqft correlates robustly with Price_Lac (r=0.763), suggesting that unit price is a key driver of total price variation. Floor and amenity features provide additional signals with moderate-to-strong relationships linking Total_Floors with Current_Floor (r=0.734), Parking_capped with Parking_num (r=0.509), and Bathroom_num with Balcony_num (r=0.341), suggesting they complement the core price and area information rather than dominate it. Additionally, “zero correlations” resulting from NA-to-zero replacement should be interpreted as data sparsity rather than statistical independence.

4.4.2 Interaction Effects

This section investigates interaction effects between property size and location by integrating raw-scale faceted scatter plots with a log-linear regression forest plot. While the facets visualize the actual price-area distributions across the Top-20 locations, the forest plot provides a standardized comparison of pricing sensitivity by extracting regression slopes.

library(broom)

# a. Raw scale (facet)
area_cap_99 <- quantile(house_df_base$Area_sqft, probs = 0.99, na.rm = TRUE)

house_df_cap <- house_df_base %>%
  filter(Area_sqft <= area_cap_99)

top_n <- 20
loc_top20_levels <- house_df_cap %>%
  count(location, sort = TRUE) %>%
  slice_head(n = top_n) %>%
  pull(location)

plot_df_top20 <- house_df_cap %>%
  filter(location %in% loc_top20_levels) %>%
  mutate(location = forcats::fct_reorder(location, Price_Lac, .fun = median, .desc = TRUE))

ggplot(plot_df_top20, aes(x = Area_sqft, y = Price_Lac)) +
  geom_point(alpha = 0.15, size = 0.8, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8, color = "firebrick") +
  scale_y_continuous(labels = label_number(big.mark = ",")) +
  facet_wrap(~ location, scales = "free_y", ncol = 4) +
  labs(
    title = "Price vs Area across Top-20 Locations (Raw Scale)",
    x = "Area (sqft)",
    y = "Price (lac)") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    strip.text = element_text(size = 8, face = "bold"),
    panel.spacing = unit(1, "lines"))

# b. Log scale (forest)
slopes_log <- plot_df_top20 %>%
  group_by(location) %>%
  do(tidy(lm(log_price ~ Area_sqft, data = .), conf.int = TRUE)) %>%
  filter(term == "Area_sqft") %>%
  ungroup() %>%
  mutate(location = reorder(location, estimate))

ggplot(slopes_log, aes(x = estimate, y = location)) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.3, color = "gray60") +
  geom_point(size = 2.5, color = "darkred") +
  geom_vline(xintercept = mean(slopes_log$estimate), linetype = "dashed", color = "blue", alpha = 0.5) +
  labs(
    title = "Pricing Sensitivity by Location (Log-Linear Slopes)",
    x = "Slope Coefficient (Area's Marginal Effect on Log Price)",
    y = NULL) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
    axis.text.y = element_text(size = 9))

# c. Top-5 vs Bottom-5 summary table
slopes_tbl <- plot_df_top20 %>%
  group_by(location) %>%
  do(tidy(lm(log_price ~ Area_sqft, data = .), conf.int = TRUE)) %>%
  ungroup() %>%
  filter(term == "Area_sqft") %>%
  left_join(
    plot_df_top20 %>% count(location, name = "n"),
    by = "location"
  ) %>%
  mutate(
    pct_uplift_100 = (exp(estimate * 100) - 1) * 100
  ) %>%
  select(location, n, estimate, conf.low, conf.high, pct_uplift_100)

top5 <- slopes_tbl %>% arrange(desc(estimate)) %>% slice_head(n = 5) %>%
  mutate(group = "Top-5")
bot5 <- slopes_tbl %>% arrange(estimate) %>% slice_head(n = 5) %>%
  mutate(group = "Bottom-5")

top_bot_10 <- bind_rows(top5, bot5)

avg_stats <- slopes_tbl %>%
  summarise(
    n_locations = n(),
    slope_mean  = mean(estimate, na.rm = TRUE),
    slope_min   = min(estimate, na.rm = TRUE),
    slope_max   = max(estimate, na.rm = TRUE)
  )

avg_row <- tibble(
  group = "Average",
  location = "NA",
  n = avg_stats$n_locations,          # number of locations contributing to the mean
  estimate = avg_stats$slope_mean,
  conf.low = NA_real_,
  conf.high = NA_real_,
  pct_uplift_100 = (exp(avg_stats$slope_mean * 100) - 1) * 100
)

top_bot_11 <- bind_rows(avg_row, top_bot_10) %>%
  mutate(
    estimate = round(estimate, 6),
    conf.low = ifelse(is.na(conf.low), NA, round(conf.low, 6)),
    conf.high = ifelse(is.na(conf.high), NA, round(conf.high, 6)),
    `Uplift per +100 sqft` = scales::percent(pct_uplift_100 / 100, accuracy = 0.01)
  ) %>%
  select(
    Group = group,
    Location = location,
    N = n,
    Slope = estimate,
    `CI (low)` = conf.low,
    `CI (high)` = conf.high,
    `Uplift per +100 sqft`
  )

top_bot_11 %>%
  knitr::kable(
    caption = "Top-5 vs Bottom-5 locations by pricing sensitivity (log_price ~ Area_sqft)",
    align = "llrrrrr"
  ) %>%
  kableExtra::kable_styling(full_width = TRUE)
Top-5 vs Bottom-5 locations by pricing sensitivity (log_price ~ Area_sqft)
Group Location N Slope CI (low) CI (high) Uplift per +100 sqft
Average NA 20 0.000870 NA NA 9.09%
Top-5 thane 1865 0.001759 0.001676 0.001841 19.23%
Top-5 kolkata 21581 0.001348 0.001337 0.001360 14.43%
Top-5 mumbai 1792 0.001214 0.001170 0.001259 12.91%
Top-5 new-delhi 24880 0.001110 0.001099 0.001121 11.74%
Top-5 pune 2143 0.001097 0.001063 0.001130 11.59%
Bottom-5 ahmedabad 12194 0.000297 0.000284 0.000310 3.02%
Bottom-5 bangalore 23185 0.000447 0.000437 0.000456 4.57%
Bottom-5 chandigarh 1399 0.000462 0.000416 0.000509 4.73%
Bottom-5 faridabad 3693 0.000486 0.000465 0.000506 4.98%
Bottom-5 zirakpur 1475 0.000496 0.000455 0.000536 5.08%

After capping Area_sqft at the 99th percentile (3,783.88 sqft), the analysis of 154,721 observations from the Top-20 locations reveals a pronounced interaction between property size and location. While all location-specific slopes from the log_price ~ Area_sqft regression are positive, they exhibit substantial variation, ranging from 0.000297 to 0.001759 with a mean of 0.000870. This indicates that the marginal impact of area on price differs materially across geographic segments. On average, a 100 sqft increase in area yields a 9.09% price uplift. Sensitivities peak in Thane, Kolkata, and Mumbai, reaching approximately 19.23%, 14.43%, and 12.91% respectively, whereas Ahmedabad and Bangalore exhibit much flatter responses at 3.02% and 4.57%. The 95% confidence intervals in the forest plot confirm that 8 locations reside significantly above the global mean while 10 fall significantly below, reinforcing that this market heterogeneity is systematic rather than mere noise.

4.5 Data Quality and Feature Engineering Strategy

4.5.1 Outlier and Skewness Treatment for Modeling

This section implements modeling-oriented preprocessing to mitigate the impact of long-tailed distributions and extreme observations. A logarithmic transformation is applied to the target variable Price_Lac to reduce skewness. Extreme values are identified using 99th-percentile thresholds for both price and area, which are used for outlier flagging and selective filtering rather than immediate value modification. Price outliers are optionally handled through quantile-based capping during model training, while anomalous coded values in auxiliary features such as Parking_num = 999 are recoded as missing.

library(tidyr)
library(knitr)
library(e1071)
### 3.5.1 Outlier Flagging & Skewness Treatment
q_hi <- 0.99

model_df <- house_df %>%
  mutate(
    Price_Lac   = as.numeric(Price_Lac),
    Area_sqft   = as.numeric(Area_sqft),
    Parking_num = as.numeric(Parking_num)) %>%
  filter(
    !is.na(Price_Lac), Price_Lac >= 0,
    !is.na(Area_sqft), Area_sqft > 0) %>%
  mutate(
    log_price = log1p(Price_Lac),
    Parking_num_fix = if_else(Parking_num == 999, NA_real_, Parking_num))

# a. Compute 99th percentile thresholds
area_q99  <- quantile(model_df$Area_sqft,  probs = q_hi, na.rm = TRUE)
price_q99 <- quantile(model_df$Price_Lac, probs = q_hi, na.rm = TRUE)

threshold_tbl <- tibble::tibble(
  Variable     = c("Area_sqft", "Price_Lac"),
  Quantile     = q_hi,
  Threshold_99 = c(as.numeric(area_q99), as.numeric(price_q99)))

threshold_tbl %>%
  knitr::kable(
    caption = "99th Percentile Thresholds (Flagging Only)",
    digits  = 3,
    align   = "lrr") %>%
  kableExtra::kable_styling(full_width = TRUE)
99th Percentile Thresholds (Flagging Only)
Variable Quantile Threshold_99
Area_sqft 0.99 3783.88
Price_Lac 0.99 700.00
# b. Create outlier flags
house_df_outlier_flag <- model_df %>%
  mutate(
    area_outlier_99  = Area_sqft  > area_q99,
    price_outlier_99 = Price_Lac  > price_q99,
    any_outlier_99   = area_outlier_99 | price_outlier_99)

audit_outlier_flag <- tibble::tibble(
  Metric = c(
    "Rows in modeling dataset",
    "Area 99th percentile threshold",
    "Price 99th percentile threshold",
    "Share flagged by Area (top 1%)",
    "Share flagged by Price (top 1%)",
    "Share flagged by either (union)",
    "Count of Parking_num coded as 999"),
  Value = c(
    nrow(house_df_outlier_flag),
    as.numeric(area_q99),
    as.numeric(price_q99),
    mean(house_df_outlier_flag$area_outlier_99, na.rm = TRUE),
    mean(house_df_outlier_flag$price_outlier_99, na.rm = TRUE),
    mean(house_df_outlier_flag$any_outlier_99, na.rm = TRUE),
    sum(model_df$Parking_num == 999, na.rm = TRUE)))

audit_outlier_flag %>%
  knitr::kable(
    caption = "Outlier Flag Audit (99th Percentile)",
    digits  = 4,
    align   = "lr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Outlier Flag Audit (99th Percentile)
Metric Value
Rows in modeling dataset 177757.0000
Area 99th percentile threshold 3783.8800
Price 99th percentile threshold 700.0000
Share flagged by Area (top 1%) 0.0100
Share flagged by Price (top 1%) 0.0066
Share flagged by either (union) 0.0120
Count of Parking_num coded as 999 1.0000
# c. Skewness comparison (Raw vs Log)
skew_table <- tibble::tibble(
  Variable = c("Price_Lac", "log_price"),
  Skewness = c(
    e1071::skewness(model_df$Price_Lac, na.rm = TRUE, type = 3),
    e1071::skewness(model_df$log_price, na.rm = TRUE, type = 3)))

skew_table %>%
  knitr::kable(
    caption = "Skewness Audit: Raw vs Log Transform",
    digits  = 3,
    align   = "lr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Skewness Audit: Raw vs Log Transform
Variable Skewness
Price_Lac 270.193
log_price 0.324
# d. Preview of flagged rows
area_outlier_preview <- house_df_outlier_flag %>%
  filter(area_outlier_99) %>%
  arrange(desc(Area_sqft)) %>%
  select(location, Area_sqft, Price_Lac, log_price, Price_per_sqft, Current_Floor, Total_Floors) %>%
  slice_head(n = 10)

area_outlier_preview %>%
  knitr::kable(
    caption = "Preview: Area Outliers Flagged (Top by Area_sqft, Values Unchanged)",
    digits  = 3,
    align   = "lrrrrrr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Preview: Area Outliers Flagged (Top by Area_sqft, Values Unchanged)
location Area_sqft Price_Lac log_price Price_per_sqft Current_Floor Total_Floors
guwahati 709222 60.0 4.111 5859 3 8
visakhapatnam 530040 105.0 4.663 0 NA NA
bhiwadi 495970 19.0 2.996 2533 10 14
agra 282004 20.0 3.045 7 3 6
gurgaon 194936 260.0 5.565 6131 20 20
agra 113134 56.0 4.043 3500 2 11
jaipur 107806 68.0 4.234 4231 11 14
thrissur 81845 45.0 3.829 NA 3 7
siliguri 81675 45.7 3.844 4200 1 5
siliguri 71775 40.2 3.718 4200 5 5
price_outlier_preview <- house_df_outlier_flag %>%
  filter(price_outlier_99) %>%
  arrange(desc(Price_Lac)) %>%
  select(location, Area_sqft, Price_Lac, log_price, Price_per_sqft, Current_Floor, Total_Floors) %>%
  slice_head(n = 10)

price_outlier_preview %>%
  knitr::kable(
    caption = "Preview: Price Outliers Flagged (Top by Price_Lac, Values Unchanged)",
    digits  = 3,
    align   = "lrrrrrr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Preview: Price Outliers Flagged (Top by Price_Lac, Values Unchanged)
location Area_sqft Price_Lac log_price Price_per_sqft Current_Floor Total_Floors
vadodara 1252 140030 11.850 6700000 5 9
siliguri 970 51004 10.840 4041600 1 4
raipur 920 39675 10.589 3450000 2 2
raipur 750 22980 10.042 2669100 2 9
varanasi 20000 8000 8.987 40000 NA NA
gurgaon 11000 6000 8.700 54545 25 25
gurgaon 5500 5500 8.613 52381 25 25
gurgaon 5400 5500 8.613 74324 10 30
gurgaon 5250 5200 8.557 54737 10 39
gurgaon 11000 5000 8.517 45455 25 25

In this section, potential extreme observations are flagged using the 99th-percentile thresholds for Area_sqft and Price_Lac, without applying capping, winsorization, or overwriting any original values in the dataset.

With q_hi = 0.99, the computed thresholds are 3,783.88 sqft for Area_sqft and 700.0 Lac for Price_Lac; these cutoffs are used solely to construct indicator variables (area_outlier_99,price_outlier_99, any_outlier_99).

The anomalous code Parking_num = 999 is recoded as missing Parking_num_fix = NA to prevent it from being interpreted as a valid numeric magnitude by the model.

Skewness is substantially reduced through logarithmic transformation, with price skewness decreasing from 270.193 for Price_Lac to 0.324 for log_price.

4.5.2 Missing and Unknown Treatment for Modeling

This section performs a unified treatment of missing and unknown values in numerical, binary, and categorical features. Missingness is audited based on NA rates, and explicitly coded “Unknown” values in categorical variables are analyzed together with NA values. Features with missing rates exceeding a predefined threshold are removed, while the remaining variables are processed according to feature type, using median imputation with missing indicators for numerical and binary features, or explicit level encoding for NA and “Unknown” values in categorical features.

### 4.5.2 Missing and Unknown Treatment for Modeling
library(stringr)

data_missing_in <- house_df_outlier_flag
missing_thresh <- 0.30

# a. Missingness summary (NA-based)
missing_tbl <- data_missing_in %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Feature", values_to = "Missing_n") %>%
  mutate(
    Missing_rate = Missing_n / nrow(data_missing_in),
    Type = case_when(
      Feature %in% names(select(data_missing_in, where(is.numeric)))   ~ "Numeric",
      Feature %in% names(select(data_missing_in, where(is.logical)))   ~ "Binary",
      Feature %in% names(select(data_missing_in, where(is.character))) ~ "Categorical",
      TRUE ~ "Other")) %>%
  arrange(desc(Missing_rate))

missing_tbl %>%
  slice_head(n = 15) %>%
  knitr::kable(
    caption = "Missingness Summary (Top 15 by NA rate)",
    digits  = 3,
    align   = "lrrl") %>%
  kableExtra::kable_styling(full_width = TRUE)
Missingness Summary (Top 15 by NA rate)
Feature Missing_n Missing_rate Type
Super_Area_sqft 101522 0.571 Numeric
Carpet_Area_sqft 76235 0.429 Numeric
Price_per_sqft 7891 0.044 Numeric
Current_Floor 7298 0.041 Numeric
Total_Floors 7298 0.041 Numeric
CurrentFloor_grp 7298 0.041 Other
TotalFloors_grp 7298 0.041 Other
Floor_Ratio 7298 0.041 Numeric
FloorPos_grp 7298 0.041 Other
Parking_num_fix 1 0.000 Numeric
location 0 0.000 Categorical
Price_Lac 0 0.000 Numeric
Area_sqft 0 0.000 Numeric
Bathroom_num 0 0.000 Numeric
Balcony_num 0 0.000 Numeric
# b. "Unknown" summary for categorical columns (string 'Unknown')
unknown_tbl <- data_missing_in %>%
  select(where(is.character)) %>%
  summarise(across(everything(), ~ sum(str_to_lower(str_trim(.)) == "unknown", na.rm = TRUE))) %>%
  pivot_longer(everything(), names_to = "Feature", values_to = "Unknown_n") %>%
  mutate(Unknown_rate = Unknown_n / nrow(data_missing_in)) %>%
  arrange(desc(Unknown_rate))

unknown_tbl %>%
  slice_head(n = 10) %>%
  knitr::kable(
    caption = "‘Unknown’ Level Frequency (Top 10 categorical features)",
    digits  = 3,
    align   = "lrr") %>%
  kableExtra::kable_styling(full_width = TRUE)
‘Unknown’ Level Frequency (Top 10 categorical features)
Feature Unknown_n Unknown_rate
facing 65659 0.369
Ownership 61855 0.348
Furnishing 2063 0.012
Transaction 67 0.000
location 0 0.000
# c. Unified modeling decision table (threshold = 30%)
decision_tbl <- missing_tbl %>%
  select(Feature, Type, Missing_n, Missing_rate) %>%
  left_join(unknown_tbl %>% select(Feature, Unknown_n, Unknown_rate), by = "Feature") %>%
  mutate(
    Unknown_n    = tidyr::replace_na(Unknown_n, 0L),
    Unknown_rate = tidyr::replace_na(Unknown_rate, 0),
    Action = case_when(
      Missing_rate >= missing_thresh ~ "Drop (high missingness >= 30%)",
      Type %in% c("Numeric", "Binary") & Missing_rate > 0 ~ "Keep; add missing flag + median impute",
      Type == "Categorical" & (Missing_rate > 0 | Unknown_rate > 0) ~ "Keep; treat NA/'Unknown' as explicit level",
      TRUE ~ "Keep"),
    Issue = case_when(
      Type == "Categorical" ~ paste0("NA rate = ", round(Missing_rate, 3), "; 'Unknown' rate = ", round(Unknown_rate, 3)),
      TRUE ~ paste0("NA rate = ", round(Missing_rate, 3)))) %>%
  arrange(desc(str_detect(Action, "^Drop")), desc(Missing_rate), desc(Unknown_rate), Feature) %>%
  select(Feature, Type, Issue, Action)

decision_tbl %>%
  knitr::kable(
    caption = "Missing/Unknown Handling Rules for Modeling (Threshold = 30%)",
    align   = "llll") %>%
  kableExtra::kable_styling(full_width = TRUE)
Missing/Unknown Handling Rules for Modeling (Threshold = 30%)
Feature Type Issue Action
Super_Area_sqft Numeric NA rate = 0.571 Drop (high missingness >= 30%)
Carpet_Area_sqft Numeric NA rate = 0.429 Drop (high missingness >= 30%)
Price_per_sqft Numeric NA rate = 0.044 Keep; add missing flag + median impute
CurrentFloor_grp Other NA rate = 0.041 Keep
Current_Floor Numeric NA rate = 0.041 Keep; add missing flag + median impute
FloorPos_grp Other NA rate = 0.041 Keep
Floor_Ratio Numeric NA rate = 0.041 Keep; add missing flag + median impute
TotalFloors_grp Other NA rate = 0.041 Keep
Total_Floors Numeric NA rate = 0.041 Keep; add missing flag + median impute
Parking_num_fix Numeric NA rate = 0 Keep; add missing flag + median impute
facing Categorical NA rate = 0; ‘Unknown’ rate = 0.369 Keep; treat NA/‘Unknown’ as explicit level
Ownership Categorical NA rate = 0; ‘Unknown’ rate = 0.348 Keep; treat NA/‘Unknown’ as explicit level
Furnishing Categorical NA rate = 0; ‘Unknown’ rate = 0.012 Keep; treat NA/‘Unknown’ as explicit level
Transaction Categorical NA rate = 0; ‘Unknown’ rate = 0 Keep; treat NA/‘Unknown’ as explicit level
Area_sqft Numeric NA rate = 0 Keep
Balcony_grp Other NA rate = 0 Keep
Balcony_num Numeric NA rate = 0 Keep
Bathroom_grp Other NA rate = 0 Keep
Bathroom_num Numeric NA rate = 0 Keep
Parking_capped Numeric NA rate = 0 Keep
Parking_grp Other NA rate = 0 Keep
Parking_num Numeric NA rate = 0 Keep
Price_Lac Numeric NA rate = 0 Keep
any_outlier_99 Binary NA rate = 0 Keep
area_outlier_99 Binary NA rate = 0 Keep
location Categorical NA rate = 0; ‘Unknown’ rate = 0 Keep
log_price Numeric NA rate = 0 Keep
price_outlier_99 Binary NA rate = 0 Keep
# d. Numeric missing handling preview (Before vs After)
pps_median <- median(data_missing_in$Price_per_sqft, na.rm = TRUE)

missing_preview <- data_missing_in %>%
  mutate(
    Price_per_sqft_missing = as.integer(is.na(Price_per_sqft)),
    Price_per_sqft_imputed = if_else(is.na(Price_per_sqft), pps_median, as.numeric(Price_per_sqft))) %>%
  filter(Price_per_sqft_missing == 1) %>%
  slice_head(n = 10) %>%
  select(
    location,
    Area_sqft,
    Price_Lac,
    Price_per_sqft,
    Price_per_sqft_imputed,
    Price_per_sqft_missing)

missing_preview %>%
  knitr::kable(
    caption = "Preview: Numeric Missing Handling (Price_per_sqft NA -> Median Impute)",
    digits  = 3,
    align   = "lrrrrr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Preview: Numeric Missing Handling (Price_per_sqft NA -> Median Impute)
location Area_sqft Price_Lac Price_per_sqft Price_per_sqft_imputed Price_per_sqft_missing
thane 530 25.0 NA 6034 1
thane 923 70.0 NA 6034 1
thane 741 155.0 NA 6034 1
thane 1200 265.0 NA 6034 1
thane 400 22.0 NA 6034 1
thane 550 25.0 NA 6034 1
thane 1600 400.0 NA 6034 1
thane 625 48.0 NA 6034 1
thane 739 68.6 NA 6034 1
thane 894 160.0 NA 6034 1

Using a 30% missingness threshold, the decision table flags 2 variables for removal due to high NA rates, while all remaining variables are retained with explicit per-type handling rules.

Among categorical features, the highest “Unknown” frequency appears in facing (rate = 0.3696), which will be carried forward as an explicit level rather than being left as ambiguous missingness.

For the numeric preview example Price_per_sqft, there are 7,891 missing observations; the median used for the illustrative imputation is 6,034.0, and the table displays the first 10 imputed rows.

4.5.3 Feature Creation

This section proposes a small set of interpretable derived features such as floor position and density-style features and conducts a lightweight audit on their availability and basic relationship with the transformed target log_price. The goal is to identify feature candidates that are well-defined, have manageable missingness, and capture housing-price mechanisms.

data_in <- house_df_outlier_flag

# a. Feature creation
feature_df <- data_in %>%
  mutate(
    Floor_Ratio = dplyr::if_else(!is.na(Total_Floors) & Total_Floors > 0,
                                 Current_Floor / Total_Floors,
                                 NA_real_),
    is_top_floor    = dplyr::if_else(!is.na(Current_Floor) & !is.na(Total_Floors),
                                     Current_Floor == Total_Floors, FALSE),
    is_ground_floor = dplyr::if_else(!is.na(Current_Floor),
                                     Current_Floor == 0, FALSE),
    log_area = log1p(Area_sqft),
    Bathroom_per_1000sqft = Bathroom_num / (Area_sqft / 1000),
    Balcony_per_1000sqft  = Balcony_num  / (Area_sqft / 1000),
    Parking_per_1000sqft  = Parking_num_fix / (Area_sqft / 1000))

# b. Availability audit for derived features
derived_vars <- c(
  "Floor_Ratio", "is_top_floor", "is_ground_floor",
  "log_area", "Bathroom_per_1000sqft", "Balcony_per_1000sqft", "Parking_per_1000sqft")

derived_audit <- feature_df %>%
  summarise(across(all_of(derived_vars), ~ mean(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Feature", values_to = "Missing_rate") %>%
  mutate(
    Type = case_when(
      Feature %in% names(select(feature_df, where(is.numeric))) ~ "Numeric",
      Feature %in% names(select(feature_df, where(is.logical))) ~ "Binary",
      TRUE ~ "Other")) %>%
  arrange(desc(Missing_rate))

derived_audit %>%
  knitr::kable(
    caption = "Derived Feature Availability Audit (Missing Rate)",
    digits  = 4,
    align   = "lrl") %>%
  kableExtra::kable_styling(full_width = TRUE)
Derived Feature Availability Audit (Missing Rate)
Feature Missing_rate Type
Floor_Ratio 0.0411 Numeric
Parking_per_1000sqft 0.0000 Numeric
is_top_floor 0.0000 Binary
is_ground_floor 0.0000 Binary
log_area 0.0000 Numeric
Bathroom_per_1000sqft 0.0000 Numeric
Balcony_per_1000sqft 0.0000 Numeric
# c. Summary stats for key numeric derived features
derived_summary <- feature_df %>%
  summarise(
    Floor_Ratio_p05 = quantile(Floor_Ratio, 0.05, na.rm = TRUE),
    Floor_Ratio_med = median(Floor_Ratio, na.rm = TRUE),
    Floor_Ratio_p95 = quantile(Floor_Ratio, 0.95, na.rm = TRUE),

    log_area_p05 = quantile(log_area, 0.05, na.rm = TRUE),
    log_area_med = median(log_area, na.rm = TRUE),
    log_area_p95 = quantile(log_area, 0.95, na.rm = TRUE),

    BathDen_p05 = quantile(Bathroom_per_1000sqft, 0.05, na.rm = TRUE),
    BathDen_med = median(Bathroom_per_1000sqft, na.rm = TRUE),
    BathDen_p95 = quantile(Bathroom_per_1000sqft, 0.95, na.rm = TRUE)) %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value")

derived_summary %>%
  knitr::kable(
    caption = "Sanity Check: Selected Derived Feature Summary (p05 / median / p95)",
    digits  = 3,
    align   = "lr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Sanity Check: Selected Derived Feature Summary (p05 / median / p95)
Metric Value
Floor_Ratio_p05 0.000
Floor_Ratio_med 0.500
Floor_Ratio_p95 1.000
log_area_p05 5.303
log_area_med 7.048
log_area_p95 7.824
BathDen_p05 1.289
BathDen_med 2.000
BathDen_p95 14.286
# d. Association check with log_price (Spearman correlation)
corr_tbl <- feature_df %>%
  summarise(
    rho_log_area    = cor(log_price, log_area,  method = "spearman", use = "complete.obs"),
    rho_floor       = cor(log_price, Floor_Ratio, method = "spearman", use = "complete.obs"),
    rho_bath_den    = cor(log_price, Bathroom_per_1000sqft, method = "spearman", use = "complete.obs"),
    rho_bal_den     = cor(log_price, Balcony_per_1000sqft,  method = "spearman", use = "complete.obs"),
    rho_park_den    = cor(log_price, Parking_per_1000sqft,  method = "spearman", use = "complete.obs")) %>%
  pivot_longer(everything(), names_to = "Feature", values_to = "Spearman_rho") %>%
  arrange(desc(abs(Spearman_rho)))

corr_tbl %>%
  knitr::kable(
    caption = "Spearman Correlation with log_price (Derived Features)",
    digits  = 3,
    align   = "lr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Spearman Correlation with log_price (Derived Features)
Feature Spearman_rho
rho_log_area 0.689
rho_park_den -0.561
rho_bal_den -0.351
rho_bath_den -0.348
rho_floor 0.031
# e. Group contrast examples (top floor / ground floor)
group_tbl <- feature_df %>%
  summarise(
    mean_log_price_all = mean(log_price, na.rm = TRUE),
    mean_log_price_top = mean(log_price[is_top_floor], na.rm = TRUE),
    mean_log_price_gnd = mean(log_price[is_ground_floor], na.rm = TRUE),
    n_top = sum(is_top_floor, na.rm = TRUE),
    n_gnd = sum(is_ground_floor, na.rm = TRUE)) %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value")

group_tbl %>%
  knitr::kable(
    caption = "Mean log_price by Floor Flags",
    digits  = 3,
    align   = "lr") %>%
  kableExtra::kable_styling(full_width = TRUE)
Mean log_price by Floor Flags
Metric Value
mean_log_price_all 4.427
mean_log_price_top 4.217
mean_log_price_gnd 4.155
n_top 21612.000
n_gnd 10577.000
# f. Preview: Newly Created Features (first 10 rows)
feature_df %>%
  select(
    location, Price_Lac, log_price, Area_sqft,
    Floor_Ratio, is_top_floor, is_ground_floor,
    log_area, Bathroom_per_1000sqft, Balcony_per_1000sqft, Parking_per_1000sqft,
    area_outlier_99, price_outlier_99, any_outlier_99) %>%
  slice_head(n = 10) %>%
  knitr::kable(
    caption = "Preview: Newly Created Features (First 10 Rows)",
    digits  = 3,
    align   = "lrrrrllrrrrlll") %>%
  kableExtra::kable_styling(full_width = TRUE)
Preview: Newly Created Features (First 10 Rows)
location Price_Lac log_price Area_sqft Floor_Ratio is_top_floor is_ground_floor log_area Bathroom_per_1000sqft Balcony_per_1000sqft Parking_per_1000sqft area_outlier_99 price_outlier_99 any_outlier_99
thane 42.0 3.761 500 0.909 FALSE FALSE 6.217 2.000 4.000 2.000 FALSE FALSE FALSE
thane 98.0 4.595 473 0.136 FALSE FALSE 6.161 4.228 4.228 2.114 FALSE FALSE FALSE
thane 140.0 4.949 779 0.345 FALSE FALSE 6.659 2.567 2.567 1.284 FALSE FALSE FALSE
thane 25.0 3.258 530 0.333 FALSE FALSE 6.275 1.887 1.887 1.887 FALSE FALSE FALSE
thane 160.0 5.081 635 0.476 FALSE FALSE 6.455 3.150 3.150 1.575 FALSE FALSE FALSE
thane 45.0 3.829 680 0.286 FALSE FALSE 6.524 1.471 1.471 1.471 FALSE FALSE FALSE
thane 16.5 2.862 550 0.800 FALSE FALSE 6.312 1.818 3.636 1.818 FALSE FALSE FALSE
thane 60.0 4.111 575 0.000 FALSE TRUE 6.356 1.739 3.478 1.739 FALSE FALSE FALSE
thane 60.0 4.111 600 0.000 FALSE TRUE 6.399 1.667 3.333 1.667 FALSE FALSE FALSE
thane 160.0 5.081 900 0.111 FALSE FALSE 6.804 3.333 1.111 1.111 FALSE FALSE FALSE

Seven interpretable derived features, Floor_Ratio, is_top_floor, is_ground_floor, log_area, Bathroom_per_1000sqft, Balcony_per_1000sqft, and Parking_per_1000sqft, are created from house_df_outlier_flag, followed by a lightweight audit of feature availability and their basic relationship with the transformed target log_price.

In terms of availability, Floor_Ratio has a missing rate of approximately 0.0411, mainly driven by missing floor information, while the remaining derived features show zero missingness.

For relationship exploration, log_area shows a strong monotonic association with log_price (\(\rho \approx 0.689\)), whereas Floor_Ratio has a weak association (\(\rho \approx 0.031\)).

A simple group contrast further indicates that the mean log_price is about 4.427 overall, compared with about 4.217 for top-floor units (n=21,612) and about 4.155 for ground-floor units (n=10,577), suggesting a directional but relatively modest floor-position effect in this dataset.

4.6 Key Insights for Modelling

4.6.1 Main Determinants of House Price

This section summarizes the main EDA findings into the key drivers of house price.

Area is the strongest driver of property value, with Area_sqft exhibiting a clear positive correlation with total price. While the raw distribution is characterized by heavy tails and unstable variance, applying a log1p transformation effectively linearizes this relationship and stabilizes the variance (detailed in Sections 3.2.1 and 3.3.1).

Location acts as the strongest segmentation factor, with prices diverging significantly across cities even for properties of comparable size. As location strongly stratifies the market while exhibiting high cardinality and a head-heavy distribution, a Top-N grouping strategy is employed to consolidate rare categories and generate stable summaries (detailed in Sections 3.2.3 and 3.3.3).

Beyond basic dimensions, amenities add extra predictive signal to the data. Features such as Bathroom_num, Balcony_num, and parking availability demonstrate distinct price differentiation across groups. These variations are similarly observed in the log-transformed price distribution, indicating that the signals are robust to extreme outliers (detailed in Section 3.3.2).

Vertical characteristics also matter, as current_floor and total_floors reveal consistent price stratification.While relative floor position indicates some variance, the differences are less pronounced compared to absolute height metrics, showing a stronger statistical association with the specific floor level rather than its relative placement within the building (detailed in Sections 3.3.2 and 3.5.3).

Transaction type and furnishing status reflect specific market segments and exhibit significant price variations. Since sample sizes in certain subgroups are limited, the stability of estimates for features like Transaction and Furnishing is constrained by data sparsity in those categories (detailed in Section 3.3.3).

The interaction between area and location demonstrates that the marginal price gain per additional square foot varies significantly across cities. This observed heterogeneity indicates that Area_sqft alone does not fully capture the price dynamics across different spatial segments (detailed in Section 3.4.2).

4.6.2 Potential issues for modelling

This section summarizes the main modelling risks identified in the EDA and the practical handling rules applied in the analysis. These issues do not make the data unusable, but ignoring them may lead to unstable or biased model results.

Skewness and heavy tails characterize the target variable, where Price_Lac is highly right-skewed on the raw scale. Consequently, log1p(Price_Lac) is applied to stabilize the distribution for modeling purposes (detailed in Sections 3.2.1 and 3.5.1).

Outliers and possible miscoding are present in the dataset, manifesting as extreme area values and abnormal codes such as Parking_num=999. Extremes are flagged and obvious miscodings are recoded to missing values to prevent the model from learning erroneous signals (detailed in Sections 3.1.1 and 3.5.1).

Missingness and “Unknown” concentration affect specific features, with some variables showing high missing rates and others, like facing and Ownership, containing large shares of “Unknown” labels. Consistent handling rules are applied to manage these gaps effectively (detailed in Sections 3.1.2 and 3.5.2).

High-cardinality and imbalance define the location variable, which exhibits many levels and a head-heavy distribution. A “Top-20” plus “Other” grouping strategy is employed to reduce sparsity prior to selecting an encoding method (detailed in Sections 3.2.3 and 3.3.3).

Multicollinearity and redundancy exist among physical dimensions, as area measures are near duplicates and Price_per_sqft is mechanically linked to the total price. Feature reduction is performed to mitigate unstable coefficients and maintain clear interpretation (detailed in Sections 3.1.2 and 3.4.1).

Heterogeneity and interactions are evident, as the marginal price premium for area varies significantly across cities, indicating structural market heterogeneity. The model must accommodate non-linear or interaction structures to avoid systematic underfitting (detailed in Sections 3.4.2 and 3.3.1).

5. Data Modelling and Evaluation

🎯 Core Objective: Build and evaluate regression models to predict housing prices and classification models to predict transaction types, comparing baseline and advanced approaches.

Function Purpose
createDataPartition() Split data into training and test sets
train() Train linear regression model (caret)
ranger() Train Random Forest regression model
predict() Generate predictions on test data
postResample() Calculate RMSE and R² metrics
rpart() Build Decision Tree classifier
randomForest() Build Random Forest classifier
confusionMatrix() Evaluate classification performance
variable.importance Extract feature importance scores
expm1() Reverse log transformation for interpretation

This section presents the full modelling pipeline, including data preprocessing, feature engineering, regression analysis for house price prediction, and classification analysis for transaction type.

All steps are designed to ensure methodological consistency, interpretability, and robustness.

5.1 Data Preprocessing

In this stage, the cleaned dataset is loaded and further refined to ensure data quality and model integrity.

Redundant area-related variables (Carpet_Area_sqft, Super_Area_sqft) and Price_per_sqft are removed to prevent multicollinearity and target leakage. Missing values in numerical variables are imputed using median values, which is robust to outliers.

Logical consistency is enforced between floor-related variables by ensuring that Total_Floors is always greater than or equal to Current_Floor. For regression purposes, the target variable Price_Lac is log-transformed using log1p to reduce skewness and stabilize variance.

library(tidyverse)
library(caret)
library(randomForest)
library(ranger)

# Load cleaned dataset
df_raw <- house_df

# Remove redundant variables
df_model <- df_raw %>%
select(-Carpet_Area_sqft, -Super_Area_sqft, -Price_per_sqft)

# Median imputation for numerical variables
num_cols <- c("Area_sqft", "Current_Floor", "Total_Floors")
df_model[num_cols] <- lapply(df_model[num_cols], function(x) {
x[is.na(x)] <- median(x, na.rm = TRUE)
x
})

# Logical consistency check
df_model$Total_Floors <- pmax(df_model$Total_Floors, df_model$Current_Floor)

# Log-transform target for regression
df_model$Log_Price <- log1p(df_model$Price_Lac)

# Remove remaining unknown transaction labels
df_model <- df_model %>%
filter(Transaction %in% c("New Property", "Resale"))

The log-transformation significantly reduces right skewness in housing prices, improving suitability for regression modelling.

library(ggplot2)
library(gridExtra)

p1 <- ggplot(df_raw, aes(x = Price_Lac)) +
geom_histogram(bins = 30, fill = "steelblue") +
theme_minimal() +
labs(title = "Original Price Distribution")

p2 <- ggplot(df_model, aes(x = Log_Price)) +
geom_histogram(bins = 30, fill = "darkorange") +
theme_minimal() +
labs(title = "Log-Transformed Price Distribution")

grid.arrange(p1, p2, ncol = 2)

The distribution of the original housing prices is highly right-skewed, with a large proportion of properties concentrated in the lower price range and a small number of extremely high-priced outliers.

After applying a log-transformation to the price variable, the distribution becomes substantially more symmetric and closer to a normal shape.This transformation effectively reduces the influence of extreme values and stabilizes variance, thereby improving the suitability of the target variable for regression modelling.

The comparison of the two distributions visually confirms that the log-transformed price provides a more appropriate response variable for predictive modelling.

5.2 Feature Engineering

Feature engineering is conducted as a unified step prior to modelling. The selected features are designed to be shared across both regression and classification tasks unless otherwise stated.

Feature selection is guided by domain knowledge and empirical reasoning: Area_sqft as the primary indicator of property size; Bathroom_num, Current_Floor, and Total_Floors as indicators of housing configuration; Location, simplified to reduce sparsity while preserving spatial information.

To improve model stability, only the top 20 most frequent locations are retained, with the remaining grouped as “Other”.

library(forcats)
df_model <- df_model %>%
mutate(location = fct_lump_n(location, n = 20, other_level = "Other")) %>%
mutate(across(where(is.character), as.factor))

The relationship between property size and price further validates the inclusion of Area_sqft.

ggplot(df_model, aes(x = Area_sqft, y = Log_Price)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "Area vs Log-Price")

The scatter plot between Area_sqft and Log_Price reveals a clear and strong positive relationship, indicating that larger properties tend to command higher prices. Although a small number of extreme-area observations exist, the overall linear trend remains stable across the majority of the data. This empirical evidence supports the inclusion of property size as a core feature in both regression and classification tasks. By simplifying the location variable and retaining only the most frequent categories, the feature space is further stabilized without sacrificing key spatial information.

5.3 Regression: Predicting House Prices

5.3.1 Objective

The objective of this regression task is to predict residential property prices using structural and locational attributes.

5.3.2 Models and Evaluation

Two models are evaluated:

Linear Regression as a baseline

Random Forest Regression to capture non-linear relationships

Performance is assessed using RMSE and \(R^2\).All reported RMSE and \(R^2\) values are computed on the held-out test set, ensuring an unbiased assessment of out-of-sample predictive performance.

set.seed(123)

train_index <- createDataPartition(df_model$Log_Price, p = 0.8, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# Linear Regression
lm_model <- train(Log_Price ~ Area_sqft + Bathroom_num + Current_Floor +
Total_Floors + location,
data = train_data, method = "lm")

lm_pred <- predict(lm_model, test_data)
lm_eval <- postResample(lm_pred, test_data$Log_Price)

# Random Forest Regression
rf_model <- ranger(Log_Price ~ Area_sqft + Bathroom_num + Current_Floor +
Total_Floors + location,
data = train_data,
num.trees = 100,
importance = "impurity")

rf_pred <- predict(rf_model, test_data)$predictions
rf_eval <- postResample(rf_pred, test_data$Log_Price)


# Training set performance
lm_train_pred <- predict(lm_model, train_data)
rf_train_pred <- predict(rf_model, train_data)$predictions

lm_train_eval <- postResample(lm_train_pred, train_data$Log_Price)
rf_train_eval <- postResample(rf_train_pred, train_data$Log_Price)

train_test_comparison <- data.frame(
  Model = c("Linear Regression", "Random Forest"),
  Train_RMSE = c(lm_train_eval["RMSE"], rf_train_eval["RMSE"]),
  Test_RMSE  = c(lm_eval["RMSE"], rf_eval["RMSE"]),
  Train_R2   = c(lm_train_eval["Rsquared"], rf_train_eval["Rsquared"]),
  Test_R2    = c(lm_eval["Rsquared"], rf_eval["Rsquared"])
)

train_test_comparison
##               Model Train_RMSE Test_RMSE  Train_R2   Test_R2
## 1 Linear Regression  0.4823692 0.4776379 0.6461527 0.6517759
## 2     Random Forest  0.1702558 0.2282906 0.9566152 0.9206369

The performance comparison shows that Random Forest regression substantially outperforms Linear Regression, achieving a significantly lower RMSE and a higher R² value. While the linear model explains approximately 64% of the variance in log-transformed prices, the Random Forest model captures over 92% of the variance. This performance gap suggests that housing prices are influenced by non-linear relationships and feature interactions that cannot be adequately modeled by a simple linear framework.

Residual diagnostics indicate strong overall generalization on the held-out test set, while also revealing increased uncertainty for properties at the upper end of the size and price distributions.

Feature importance analysis highlights Area_sqft as the dominant driver of housing prices, followed by Bathroom_num and location. Floor-related variables contribute relatively less, suggesting that overall space and functional layout play a more critical role than vertical positioning within a building.

# Residual Diagnostics for Random Forest Regression
rf_residuals <- test_data$Log_Price - rf_pred

residual_df <- data.frame(
  Fitted = rf_pred,
  Residuals = rf_residuals
)

ggplot(residual_df, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(
    title = "Residual Plot: Random Forest Regression",
    x = "Predicted Log-Price",
    y = "Residuals"
  )

The residual plot shows no clear systematic pattern, with residuals approximately centered around zero across the range of predicted values. This indicates that the Random Forest model does not suffer from strong bias or heteroscedasticity. While slightly larger variance is observed at the higher end of predicted prices, this is likely driven by the limited number of high-value observations rather than systematic model misspecification. Overall, the residual diagnostics support the model’s good generalization performance on unseen data.

#  Top 5 Prediction Errors Analysis
error_data <- test_data %>%
  mutate(
    Predicted_Price = expm1(rf_pred), 
    Actual_Price = expm1(Log_Price),
    Absolute_Error = abs(Actual_Price - Predicted_Price),
    Percentage_Error = (Absolute_Error / Actual_Price) * 100
  ) %>%
  arrange(desc(Absolute_Error)) %>%
  head(5) %>%
  select(location, Area_sqft, Bathroom_num, Actual_Price, Predicted_Price, Percentage_Error)


knitr::kable(error_data, caption = "Deep Dive into Top 5 Prediction Errors")
Deep Dive into Top 5 Prediction Errors
location Area_sqft Bathroom_num Actual_Price Predicted_Price Percentage_Error
Other 20000 10 8000 461.13248 94.23584
gurgaon 5500 5 5500 1096.29137 80.06743
Other 2200 1 2700 66.10883 97.55152
pune 1720 4 2700 194.95301 92.77952
gurgaon 4800 5 2700 622.57867 76.94153

The largest prediction errors are concentrated among atypical properties, particularly very large units located in high-end neighborhoods. In these cases, housing prices are driven by factors beyond structural characteristics, such as project branding, luxury positioning, school districts, and neighborhood prestige, which are not captured in the current dataset.

As a result, the model tends to under-estimate high-end properties by extrapolating from mid-market patterns. This behavior reflects a limitation of the available features rather than model misspecification, and highlights the existence of distinct pricing regimes within the housing market.

From a practical perspective, the model is best interpreted as a mid-market pricing tool rather than a luxury housing valuation system.

5.3.3 Results and Discussion

The Random Forest model substantially outperforms Linear Regression in both RMSE and , indicating strong non-linear effects in housing price formation. All RMSE values are reported on the log-transformed price scale and therefore reflect relative rather than absolute price deviations.

imp_reg <- data.frame(
Feature = names(rf_model$variable.importance),
Importance = rf_model$variable.importance
)

ggplot(imp_reg, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Feature Importance: Price Regression")

The feature importance analysis indicates that Area_sqft is by far the most influential predictor of housing prices, substantially outweighing all other variables. This highlights that property size is the primary driver of price variation in the dataset. Bathroom_num ranks as the second most important feature, suggesting that internal layout and functional capacity also play a critical role in determining property value. In contrast, location shows a moderate level of importance, indicating that while neighborhood effects exist, they are secondary to structural characteristics in this dataset. Floor-related variables (Total_Floors and Current_Floor) contribute relatively little to price prediction, implying that vertical positioning has limited impact once size and layout are accounted for.

# Interaction Visualization (Area vs. Location)
df_model %>%
  filter(location != "Other") %>%
  ggplot(aes(x = Area_sqft, y = Log_Price, color = location)) +
  geom_smooth(method = "lm", se = FALSE, size = 1, alpha = 0.6) +
  theme_minimal() +
  labs(
    title = "Price Elasticity of Area Across Locations",
    subtitle = "Slope variation indicates different premiums for additional space",
    x = "Area (sqft)",
    y = "Log Price",
    color = "Top Locations"
  ) +
  theme(legend.position = "bottom")

The interaction plot reveals that the relationship between property size (Area_sqft) and price is not uniform across all neighborhoods. The varying slopes of the regression lines indicate that the “price premium” for an additional square foot is significantly higher in certain prime locations compared to others. This suggests that in high-demand areas, space is a luxury good with higher marginal value, whereas in emerging areas, the price scales more linearly with size.

5.3.4 Size-Segmented Regression

To further investigate whether prediction performance varies across property size ranges, we conduct a size-segmented regression analysis. Properties are divided into mainstream and large-size segments using a 2,500 sqft threshold, and model performance and error behavior are compared across segments to assess heterogeneity in price formation and prediction uncertainty.

quantile(df_model$Area_sqft, probs = c(0.8, 0.85, 0.9, 0.95))
##  80%  85%  90%  95% 
## 1680 1800 1950 2500

The area distribution shows that 2,500 sqft corresponds approximately to the 95th percentile of property size, indicating that properties above this threshold represent a small but distinct upper-tail segment of the market. This empirical cutoff supports the use of 2,500 sqft as a boundary between mainstream and large-size properties.

df_error <- test_data %>%
  mutate(
    Predicted = expm1(rf_pred),
    Actual = expm1(Log_Price),
    Abs_Error = abs(Actual - Predicted)
  ) %>%
  mutate(Size_Group = ifelse(Area_sqft <= 2500, "≤ 2500 sqft", "> 2500 sqft")) %>%
  group_by(Size_Group) %>%
  summarise(
    Mean_Abs_Error = mean(Abs_Error),
    Median_Abs_Error = median(Abs_Error)
  )

df_error
## # A tibble: 2 × 3
##   Size_Group  Mean_Abs_Error Median_Abs_Error
##   <chr>                <dbl>            <dbl>
## 1 > 2500 sqft           58.7            0.860
## 2 ≤ 2500 sqft           10.2            1.16

Error statistics further confirm this pattern: large-size properties (> 2,500 sqft) exhibit significantly higher mean absolute prediction errors compared to mainstream properties, while median errors remain relatively small. This suggests that a subset of extreme luxury properties drives most of the large errors.

# Error vs Area for threshold justification
df_error_raw <- test_data %>%
  mutate(
    Predicted = rf_pred,
    Abs_Error = abs(Log_Price - Predicted)
  )

ggplot(df_error_raw, aes(x = Area_sqft, y = Abs_Error)) +
  geom_point(alpha = 0.4) +
  geom_vline(xintercept = 2500, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(
    title = "Prediction Error vs Property Size",
    x = "Area (sqft)",
    y = "Absolute Error (log scale)"
  )

Prediction errors increase in dispersion as property size grows. While most large-size properties remain reasonably predicted, a small number of extreme luxury units exhibit disproportionately large errors, inflating the mean error but leaving the median relatively stable.

#Split dataset by property size

df_mainstream <- df_model %>% filter(Area_sqft <= 2500)
df_large <- df_model %>% filter(Area_sqft > 2500)

#Train-test split for mainstream segment

set.seed(123)
idx_main <- createDataPartition(df_mainstream$Log_Price, p = 0.8, list = FALSE)
train_main <- df_mainstream[idx_main, ]
test_main <- df_mainstream[-idx_main, ]

#Train-test split for large-size segment

idx_large <- createDataPartition(df_large$Log_Price, p = 0.8, list = FALSE)
train_large <- df_large[idx_large, ]
test_large <- df_large[-idx_large, ]

#Random Forest for mainstream properties

rf_main <- ranger(
Log_Price ~ Area_sqft + Bathroom_num + Current_Floor + Total_Floors + location,
data = train_main,
num.trees = 100,
importance = "impurity"
)

pred_main <- predict(rf_main, test_main)$predictions
eval_main <- postResample(pred_main, test_main$Log_Price)

#Random Forest for large-size properties

rf_large <- ranger(
Log_Price ~ Area_sqft + Bathroom_num + Current_Floor + Total_Floors + location,
data = train_large,
num.trees = 100,
importance = "impurity"
)

pred_large <- predict(rf_large, test_large)$predictions
eval_large <- postResample(pred_large, test_large$Log_Price)

#Performance comparison

segmented_results <- data.frame(
Segment = c("Mainstream (≤ 2500 sqft)", "Large-size (> 2500 sqft)"),
RMSE = c(eval_main["RMSE"], eval_large["RMSE"]),
R2 = c(eval_main["Rsquared"], eval_large["Rsquared"])
)

segmented_results
##                    Segment      RMSE        R2
## 1 Mainstream (≤ 2500 sqft) 0.2293624 0.9062465
## 2 Large-size (> 2500 sqft) 0.2858889 0.8155836
#Feature importance for mainstream segment

imp_main <- data.frame(
Feature = names(rf_main$variable.importance),
Importance = rf_main$variable.importance,
Segment = "Mainstream"
)

#Feature importance for large-size segment

imp_large <- data.frame(
Feature = names(rf_large$variable.importance),
Importance = rf_large$variable.importance,
Segment = "Large-size"
)

imp_combined <- bind_rows(imp_main, imp_large)

ggplot(imp_combined, aes(x = reorder(Feature, Importance), y = Importance, fill = Segment)) +
geom_col(position = "dodge") +
coord_flip() +
theme_minimal() +
labs(
title = "Feature Importance Comparison Across Size Segments",
x = "Feature",
y = "Importance (Node Impurity)"
)

While the segmented model maintains strong predictive performance for mainstream properties, performance deteriorates in the large-size segment, with higher RMSE and lower R². This confirms that pricing mechanisms in the luxury segment are more heterogeneous and less well explained by structural features alone.

5.4 Classification: Predicting Transaction Type

5.4.1 Objective

The objective of this classification task is to examine whether transaction type (new property vs resale) can be predicted solely based on structural and locational characteristics. By doing so, we aim to assess whether new and resale properties exhibit systematic differences in their physical configurations and spatial distributions.

5.4.2 Models and Evaluation

A Decision Tree is used as a baseline and compared against a Random Forest Classifier. Performance is evaluated using accuracy and confusion matrices.

library(rpart)

# Decision Tree
dt_model <- rpart(Transaction ~ Area_sqft + Bathroom_num +
Current_Floor + Total_Floors + location,
data = train_data, method = "class")

dt_pred <- predict(dt_model, test_data, type = "class")
dt_acc  <- mean(dt_pred == test_data$Transaction)

# Random Forest Classification
rf_cls <- randomForest(Transaction ~ Area_sqft + Bathroom_num +
Current_Floor + Total_Floors + location,
data = train_data, ntree = 100)

rf_cls_pred <- predict(rf_cls, test_data)
rf_acc <- mean(rf_cls_pred == test_data$Transaction)

cls_results <- data.frame(
Model = c("Decision Tree", "Random Forest"),
Accuracy = c(dt_acc, rf_acc)
)
cls_results
##           Model  Accuracy
## 1 Decision Tree 0.8558553
## 2 Random Forest 0.9289165
confusionMatrix(rf_cls_pred, test_data$Transaction)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     New Property Resale
##   New Property         6337    583
##   Resale               1933  26542
##                                           
##                Accuracy : 0.9289          
##                  95% CI : (0.9262, 0.9316)
##     No Information Rate : 0.7664          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7896          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7663          
##             Specificity : 0.9785          
##          Pos Pred Value : 0.9158          
##          Neg Pred Value : 0.9321          
##              Prevalence : 0.2336          
##          Detection Rate : 0.1790          
##    Detection Prevalence : 0.1955          
##       Balanced Accuracy : 0.8724          
##                                           
##        'Positive' Class : New Property    
## 

5.4.3 Results and Discussion

The Random Forest classifier achieves a classification accuracy of 92.66%, substantially outperforming the baseline Decision Tree model. This accuracy is also significantly higher than the no-information rate, indicating that the model provides meaningful predictive value beyond majority-class guessing.

Although the dataset exhibits class imbalance favoring resale properties, the Random Forest classifier achieves strong balanced accuracy and Kappa statistics, indicating robust performance across both classes.

The confusion matrix reveals that the model is particularly effective at identifying resale properties, with very high specificity.Although sensitivity for new properties is comparatively lower, the balanced accuracy remains high, suggesting a reasonable trade-off between class-level performance.The Kappa statistic indicates strong agreement beyond chance, further confirming the robustness of the classification results.

Overall, these findings suggest that transaction type is influenced by complex interactions among structural and locational features, which are well captured by ensemble learning methods.This confirms that transaction status is not purely an administrative label, but reflects underlying differences in property design and location, which can be effectively identified through data-driven classification.

5.5 Final Conclusion

This study addresses two key research questions:

  1. Can housing prices be accurately predicted using structural and locational attributes?

  2. Can transaction type (new property vs resale) be effectively classified using the same feature set?

The regression analysis demonstrates that housing prices can be predicted with high accuracy, particularly when non-linear models such as Random Forest are employed.Property size and internal configuration emerge as the most influential determinants of price, with location providing additional explanatory power.

The classification analysis further shows that transaction type can be reliably distinguished based on structural and spatial characteristics. The Random Forest classifier achieves strong overall performance and significantly outperforms baseline models, indicating that meaningful patterns exist beyond simple heuristics.

Taken together, the results confirm that a unified feature framework can successfully support both regression and classification tasks, providing a comprehensive understanding of the housing market from both pricing and transaction perspectives.

5.6 Limitations

Despite the strong overall performance of the proposed models, several limitations should be acknowledged.

First, the size-segmented regression analysis indicates that pricing mechanisms differ substantially between mainstream and large-size (luxury) properties. While structural attributes such as area and bathroom count explain most of the price variation in the mainstream segment, luxury property prices are influenced by additional factors that are not captured in the dataset. These include developer reputation, neighborhood prestige, scenic views, privacy, and interior customization, all of which may play a decisive role in high-end housing markets. This limitation directly motivates the size-segmented analysis presented in Section 5.3.4, which partially addresses—but does not fully eliminate—these structural differences.

Second, although log-transformation and robust modeling techniques mitigate the influence of extreme values, prediction uncertainty remains higher for large-size properties due to their inherent heterogeneity and limited comparability. As a result, even segment-specific models may exhibit higher variance in the upper tail of the price distribution.

Finally, the classification task focuses only on the two dominant transaction categories (new property and resale). While this choice is justified by sample size considerations, more granular transaction types and temporal dynamics could be explored in future work if richer data become available.

6.Conclusion & Discussion

6.1 Overall Conclusions

This study examined the main factors affecting residential housing prices and compared different modelling approaches for price prediction. The results show that housing prices are mainly influenced by location and property size. While structural features such as bathrooms also contribute to distinguishing transaction types, their impact on price remains secondary compared to location and area-related variables.

Exploratory Data Analysis indicated that housing prices vary significantly across locations and that the price distribution is highly skewed, with most properties concentrated at lower price levels and a small number of extremely high-priced outliers. As a result, the log transform was applied in the regression modelling stage to improve model stability. Both regression and classification results consistently highlight the importance of location and area-related features, indicating their strong relevance to housing prices and market segmentation.

In addition, the Random Forest regression model demonstrated strong predictive performance, achieving a test R² value of 0.920. The classification model also achieved high accuracy in distinguishing between new and resale properties. This suggests that property characteristics are useful not only for price prediction but also for identifying different market segments. Overall, the findings indicate that housing prices are determined by a combination of location-related factors and basic physical attributes of the property.

6.2 Business and Practical Implications

The findings have practical value for stakeholders such as buyers, sellers, developers, and financial institutions. Since location and property size have the strongest influence on housing prices, these factors should be prioritised in pricing decisions and property valuation.

In addition to linear regression, a Random Forest model was used to capture non-linear relationships in the data and demonstrated better performance. This suggests that residential pricing involves complex interactions between variables that may not be fully captured by linear models. Therefore, machine learning approaches are more suitable for prediction-oriented tasks, particularly for large datasets.

The robust R-squared values observed reflect the high explanatory power of area-related variables and the large sample size, highlighting the need for careful feature selection to avoid information leakage. In practice, while Random Forest supports accurate prediction, understanding key price drivers remains essential for meaningful market analysis.

6.3 Limitations and Future Improvements

This study has several limitations. First, the dataset does not include time-related variables, which limits the analysis of price changes over time. Second, detailed location information such as geographic coordinates and neighbourhood characteristics is not available, which may reduce the model’s explanatory power.

Future studies could include time-series data to capture market trends and add more detailed location features to improve model performance. Finally, to mitigate the risk of information leakage, original price-related variables were intentionally excluded during model training. This helps ensure that the reported model performance reflects genuine predictive ability.