Group Members: Chen HuiLin (23113750), Qiu YiMei (23052209), Wang Shibo (23111363), Lyu YangYang (24059665), Wang Kexin (24064883)
1 Introduction
The real estate market is not only directly
related to the healthy development of the country’s economy, but also
closely related to people’s lives. Nowadays, with the process of
urbanization, the proportion of the second-hand housing market is
gradually increasing. The second-hand housing market can reflect the
supply and demand relationship of the real estate market. This project
will use Perth, Australia as an example to carry out research on
second-hand housing prices. Data analysis. Aimed at providing data
guidance to market suppliers and consumers.
1.1 Objectives
(1) Predicting second-hand house prices in Perth
based on machine learning algorithms.
(2) To classify Perth
second-hand house into low-priced, medium-priced, high-priced category
based on various factors with classification algorithm.
2 Methods and Analysis
2.1 Data Preparation
The dataset was
obtained from https://www.kaggle.com/datasets/syuzai/perth-house-prices.
This data was scraped from http://house.speakingsame.com/ and includes data from
322 Perth suburbs, resulting in an average of about 100 rows per
suburb.
2.2 Data Exploration
Load data
head(all_perth_df)
## ADDRESS SUBURB PRICE BEDROOMS BATHROOMS GARAGE LAND_AREA
## 1 1 Acorn Place South Lake 565000 4 2 2 600
## 2 1 Addis Way Wandi 365000 3 2 2 351
## 3 1 Ainsley Court Camillo 287000 3 1 1 719
## 4 1 Albert Street Bellevue 255000 2 1 2 651
## 5 1 Aman Place Lockridge 325000 4 1 2 466
## 6 1 Amethyst Crescent Mount Richon 409000 4 2 1 759
## FLOOR_AREA BUILD_YEAR CBD_DIST NEAREST_STN NEAREST_STN_DIST
## 1 160 2003 18300 Cockburn Central Station 1800
## 2 139 2013 26900 Kwinana Station 4900
## 3 86 1979 22600 Challis Station 1900
## 4 59 1953 17900 Midland Station 3600
## 5 131 1998 11200 Bassendean Station 2000
## 6 118 1991 27300 Armadale Station 1000
## DATE_SOLD POSTCODE LATITUDE LONGITUDE NEAREST_SCH
## 1 09-2018 6164 -32.11590 115.8424 LAKELAND SENIOR HIGH SCHOOL
## 2 02-2019 6167 -32.19347 115.8596 ATWELL COLLEGE
## 3 06-2015 6111 -32.12058 115.9936 KELMSCOTT SENIOR HIGH SCHOOL
## 4 07-2018 6056 -31.90055 116.0380 SWAN VIEW SENIOR HIGH SCHOOL
## 5 11-2016 6054 -31.88579 115.9478 KIARA COLLEGE
## 6 03-2013 6112 -32.15380 116.0237 ARMADALE SENIOR HIGH SCHOOL
## NEAREST_SCH_DIST NEAREST_SCH_RANK
## 1 0.8283386 NA
## 2 5.5243244 129
## 3 1.6491782 113
## 4 1.5714009 NA
## 5 1.5149216 NA
## 6 1.2272192 NA
colnames(all_perth_df)
## [1] "ADDRESS" "SUBURB" "PRICE" "BEDROOMS"
## [5] "BATHROOMS" "GARAGE" "LAND_AREA" "FLOOR_AREA"
## [9] "BUILD_YEAR" "CBD_DIST" "NEAREST_STN" "NEAREST_STN_DIST"
## [13] "DATE_SOLD" "POSTCODE" "LATITUDE" "LONGITUDE"
## [17] "NEAREST_SCH" "NEAREST_SCH_DIST" "NEAREST_SCH_RANK"
str(all_perth_df)
## 'data.frame': 33656 obs. of 19 variables:
## $ ADDRESS : chr "1 Acorn Place" "1 Addis Way" "1 Ainsley Court" "1 Albert Street" ...
## $ SUBURB : chr "South Lake" "Wandi" "Camillo" "Bellevue" ...
## $ PRICE : int 565000 365000 287000 255000 325000 409000 400000 370000 565000 685000 ...
## $ BEDROOMS : int 4 3 3 2 4 4 3 4 4 3 ...
## $ BATHROOMS : int 2 2 1 1 1 2 2 2 2 2 ...
## $ GARAGE : chr "2" "2" "1" "2" ...
## $ LAND_AREA : int 600 351 719 651 466 759 386 468 875 552 ...
## $ FLOOR_AREA : int 160 139 86 59 131 118 132 158 168 126 ...
## $ BUILD_YEAR : chr "2003" "2013" "1979" "1953" ...
## $ CBD_DIST : int 18300 26900 22600 17900 11200 27300 28200 41700 12100 5900 ...
## $ NEAREST_STN : chr "Cockburn Central Station" "Kwinana Station" "Challis Station" "Midland Station" ...
## $ NEAREST_STN_DIST: int 1800 4900 1900 3600 2000 1000 3700 1100 2500 508 ...
## $ DATE_SOLD : chr "09-2018" "02-2019" "06-2015" "07-2018" ...
## $ POSTCODE : int 6164 6167 6111 6056 6054 6112 6112 6169 6022 6053 ...
## $ LATITUDE : num -32.1 -32.2 -32.1 -31.9 -31.9 ...
## $ LONGITUDE : num 116 116 116 116 116 ...
## $ NEAREST_SCH : chr "LAKELAND SENIOR HIGH SCHOOL" "ATWELL COLLEGE" "KELMSCOTT SENIOR HIGH SCHOOL" "SWAN VIEW SENIOR HIGH SCHOOL" ...
## $ NEAREST_SCH_DIST: num 0.828 5.524 1.649 1.571 1.515 ...
## $ NEAREST_SCH_RANK: int NA 129 113 NA NA NA NA NA NA 29 ...
The dataset contains a total of 33,656 entries.
The features
include:ADDRESS,SUBURB,PRICE,BEDROOMS,BATHROOMS,GARAGE,LAND_AREA,FLOOR_AREA,BUILD_YEAR,
CBD_DIST, NEAREST_STN, NEAREST_STN_DIST, DATE_SOLD, POSTCODE, LATITUDE,
LONGITUDE, NEAREST_SCH, NEAREST_SCH_DIST, NEAREST_SCH_RANK.
Address:
The specific location of the property.
- Convert the address into
latitude (LATITUDE) and longitude (LONGITUDE) to provide geospatial
information for analysis.
SUBURB: The neighborhood where the property is located.
Price: The actual selling price of the property.
- Remove this
feature in the test set, as price is the target variable and should not
be directly used by the model.
Bedrooms: The number of bedrooms in the property.
- The number
of bedrooms directly affects the functionality and value of the
property, making it a key predictor of price.
Bathrooms: The number of bathrooms in the property.
- The number
of bathrooms relates to the comfort and usability of the house,
influencing its price.
Garage: The number of garages associated with the property.
-
The number of garages reflects parking convenience and adds value to the
property, impacting its price.
Floor Area: The indoor usable area of the property (in square
meters).
- Floor area is a critical determinant of property value
and is usually positively correlated with price.
LAND_AREA: The shared area of the community where the property is
located.
- Common areas may influence the quality of community
amenities, indirectly affecting property prices.
BUILD_YEAR: The year the property was constructed, ranging from 1868
to 2017.
- Indicates the property’s age, which can be calculated
with the selling date. The age of a property directly affects its
condition and value.
CBD_DIST: The distance from the property to the nearest commercial
area (in kilometers).
- Properties closer to commercial areas are
generally more valuable due to better accessibility.
NEAREST_STN_DIST:The distance from the property to the nearest subway
station (in kilometers).
- Proximity to subway stations often
increases a property’s appeal, as convenient commuting adds
value.
DATE_SOLD: The date the property was sold, ranging from 1990 to 2020.
- Sale date captures the effect of time trends on property prices,
which tend to fluctuate over time.
NEAREST_SCH_DIST: The distance from the property to the nearest
school (in kilometers).
- Access to educational resources is a key
factor affecting property value, with properties closer to schools often
being more desirable.
NEAREST_SCH_RANK: The ranking of nearby schools.
summary(all_perth_df)
## ADDRESS SUBURB PRICE BEDROOMS
## Length:33656 Length:33656 Min. : 51000 Min. : 1.000
## Class :character Class :character 1st Qu.: 410000 1st Qu.: 3.000
## Mode :character Mode :character Median : 535500 Median : 4.000
## Mean : 637072 Mean : 3.659
## 3rd Qu.: 760000 3rd Qu.: 4.000
## Max. :2440000 Max. :10.000
##
## BATHROOMS GARAGE LAND_AREA FLOOR_AREA
## Min. : 1.000 Length:33656 Min. : 61 Min. : 1.0
## 1st Qu.: 1.000 Class :character 1st Qu.: 503 1st Qu.:130.0
## Median : 2.000 Mode :character Median : 682 Median :172.0
## Mean : 1.823 Mean : 2741 Mean :183.5
## 3rd Qu.: 2.000 3rd Qu.: 838 3rd Qu.:222.2
## Max. :16.000 Max. :999999 Max. :870.0
##
## BUILD_YEAR CBD_DIST NEAREST_STN NEAREST_STN_DIST
## Length:33656 Min. : 681 Length:33656 Min. : 46
## Class :character 1st Qu.:11200 Class :character 1st Qu.: 1800
## Mode :character Median :17500 Mode :character Median : 3200
## Mean :19777 Mean : 4523
## 3rd Qu.:26600 3rd Qu.: 5300
## Max. :59800 Max. :35500
##
## DATE_SOLD POSTCODE LATITUDE LONGITUDE
## Length:33656 Min. :6003 Min. :-32.47 Min. :115.6
## Class :character 1st Qu.:6050 1st Qu.:-32.07 1st Qu.:115.8
## Mode :character Median :6069 Median :-31.93 Median :115.9
## Mean :6089 Mean :-31.96 Mean :115.9
## 3rd Qu.:6150 3rd Qu.:-31.84 3rd Qu.:116.0
## Max. :6558 Max. :-31.46 Max. :116.3
##
## NEAREST_SCH NEAREST_SCH_DIST NEAREST_SCH_RANK
## Length:33656 Min. : 0.07091 Min. : 1.00
## Class :character 1st Qu.: 0.88057 1st Qu.: 39.00
## Mode :character Median : 1.34552 Median : 68.00
## Mean : 1.81527 Mean : 72.67
## 3rd Qu.: 2.09722 3rd Qu.:105.00
## Max. :23.25437 Max. :139.00
## NA's :10952
2.3 Data Cleaning
2.3.1 Duplicate values
Check if the dataset contains duplicate values.
Check for
duplicates (if have duplicates, return TRUE).
duplicates <- duplicated(all_perth_df)
any(duplicates)
## [1] FALSE
2.3.2 Missing values
Data transformation: Convert
BUILD_YEAR and DATE_SOLD to date
format.
# BUILD_YEAR transfer date
all_perth_df$BUILD_YEAR <- as.Date(paste(all_perth_df$BUILD_YEAR, "01-01", sep = "-"), format = "%Y-%m-%d")
# GARAGE NA
all_perth_df$GARAGE <- as.numeric(gsub("[^0-9]", NA, all_perth_df$GARAGE))
# DATE_SOLD transfer date
all_perth_df$DATE_SOLD <- as.Date(paste("01", all_perth_df$DATE_SOLD, sep = "-"), format = "%d-%m-%Y")
Check for missing values in each feature and calculate the number of
missing values for each feature.
null_counts <- colSums(is.na(all_perth_df))
null_counts
## ADDRESS SUBURB PRICE BEDROOMS
## 0 0 0 0
## BATHROOMS GARAGE LAND_AREA FLOOR_AREA
## 0 2478 0 0
## BUILD_YEAR CBD_DIST NEAREST_STN NEAREST_STN_DIST
## 3155 0 0 0
## DATE_SOLD POSTCODE LATITUDE LONGITUDE
## 0 0 0 0
## NEAREST_SCH NEAREST_SCH_DIST NEAREST_SCH_RANK
## 0 0 10952
GARAGE and
BUILD_YEAR features.all_perth_df <- all_perth_df[, !(colnames(all_perth_df) %in% 'NEAREST_SCH_RANK')]
all_perth_df <- all_perth_df[!is.na(all_perth_df$GARAGE) & !is.na(all_perth_df$BUILD_YEAR), ]
str(all_perth_df)
## 'data.frame': 28256 obs. of 18 variables:
## $ ADDRESS : chr "1 Acorn Place" "1 Addis Way" "1 Ainsley Court" "1 Albert Street" ...
## $ SUBURB : chr "South Lake" "Wandi" "Camillo" "Bellevue" ...
## $ PRICE : int 565000 365000 287000 255000 325000 409000 400000 370000 565000 685000 ...
## $ BEDROOMS : int 4 3 3 2 4 4 3 4 4 3 ...
## $ BATHROOMS : int 2 2 1 1 1 2 2 2 2 2 ...
## $ GARAGE : num 2 2 1 2 2 1 2 2 3 8 ...
## $ LAND_AREA : int 600 351 719 651 466 759 386 468 875 552 ...
## $ FLOOR_AREA : int 160 139 86 59 131 118 132 158 168 126 ...
## $ BUILD_YEAR : Date, format: "2003-01-01" "2013-01-01" ...
## $ CBD_DIST : int 18300 26900 22600 17900 11200 27300 28200 41700 12100 5900 ...
## $ NEAREST_STN : chr "Cockburn Central Station" "Kwinana Station" "Challis Station" "Midland Station" ...
## $ NEAREST_STN_DIST: int 1800 4900 1900 3600 2000 1000 3700 1100 2500 508 ...
## $ DATE_SOLD : Date, format: "2018-09-01" "2019-02-01" ...
## $ POSTCODE : int 6164 6167 6111 6056 6054 6112 6112 6169 6022 6053 ...
## $ LATITUDE : num -32.1 -32.2 -32.1 -31.9 -31.9 ...
## $ LONGITUDE : num 116 116 116 116 116 ...
## $ NEAREST_SCH : chr "LAKELAND SENIOR HIGH SCHOOL" "ATWELL COLLEGE" "KELMSCOTT SENIOR HIGH SCHOOL" "SWAN VIEW SENIOR HIGH SCHOOL" ...
## $ NEAREST_SCH_DIST: num 0.828 5.524 1.649 1.571 1.515 ...
The results show that GARAGE has 2,478 missing values,
BUILD_YEAR has 3,155 missing values, and
NEAREST_SCH_DIST has 10,952 missing values.
Due to the
excessive number of missing values in the school ranking, the column has
been deleted.
The missing values in the GARAGE and
BUILD_YEAR columns have been removed. After deletion, the
dataset now contains 28,256 entries.
2.3.3 Outliers
Check for outliers in the following features: number of bedrooms,
number of bathrooms, number of garages, floor area, common area, and the
three distance features.
Calculate the IQR (Interquartile Range) to
define the outlier range.
features_to_check <- c('BEDROOMS', 'LAND_AREA', 'FLOOR_AREA', 'CBD_DIST', 'NEAREST_STN_DIST', 'NEAREST_SCH_DIST','GARAGE')
par(mfrow = c(3, 3), mar = c(5, 5, 2, 1))
for (feature in features_to_check) {
boxplot(all_perth_df[[feature]], main = paste('Boxplot of', feature), col = 'lightblue', horizontal = TRUE,
xlab = feature, ylab = 'Value')
}
par(mfrow = c(1, 1))
Then delete all outliers
Add new feature columns
HOUSE_AGE (New Feature): The age of the
property at the time of sale, calculated as:
{HOUSE_AGE} = {Year of
Sale} - {Year Built}
- HOUSE_AGE reflects its condition, which
typically influences its price.
all_perth_df$SOLD_YEAR <- as.numeric(format(all_perth_df$DATE_SOLD, "%Y"))
all_perth_df$BUILD_YEAR <- as.numeric(format(all_perth_df$BUILD_YEAR, "%Y"))
# HOUSE_AGE
all_perth_df$HOUSE_AGE <- all_perth_df$SOLD_YEAR - all_perth_df$BUILD_YEAR
head(all_perth_df$HOUSE_AGE)
## [1] 15 65 18 2 6 15
3 EDA
features_to_plot <- c('PRICE', 'BEDROOMS', 'BATHROOMS', 'LAND_AREA', 'FLOOR_AREA', 'HOUSE_AGE')
missing_features <- setdiff(features_to_plot, colnames(data))
if (length(missing_features) > 0) {
stop("not:", paste(missing_features, collapse = ", "))
}
plots <- list()
for (feature in features_to_plot) {
if (!is.numeric(data[[feature]])) {
warning(paste("jump:", feature))
next
}
p <- ggplot(data, aes_string(x = feature)) +
geom_histogram(fill = "lightblue", alpha = 0.8, bins = 30) +
theme_classic() +
labs(
title = paste("Distribution of", feature),
x = feature,
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 8)
)
plots[[feature]] <- p
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
do.call(grid.arrange, c(plots, ncol = 2))
Distribution Characteristics:
- The distribution of house
age (HOUSE_AGE) is right-skewed, with most houses being 0-40 years old.
- Houses aged 10-20 years are the most common, indicating a
relatively high proportion of newly built homes.
- A small number
of houses are over 80 years old, with some even reaching 100 years,
possibly representing historical buildings or older homes that have not
been renovated or rebuilt.
Data Implications:
- The concentration of relatively low house
ages suggests significant real estate development in the area in recent
years.
- The small number of older houses may belong to specific
areas or have historical value.
Data Applications:
- House age could be a key feature for
predicting house prices, as newer houses are likely to have higher
values.
- Further analysis of the relationship between house age
and price is recommended to confirm its impact on property values.
suburb_count <- data %>%
group_by(SUBURB) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice_head(n = 20)
p <- ggplot(suburb_count, aes(x = reorder(SUBURB, -count), y = count)) +
geom_bar(stat = "identity", fill ="lightblue", alpha = 0.8) +
theme_classic() +
labs(
title = "Top 20 Suburbs by Frequency",
x = "SUBURB",
y = "Count"
) +
theme(
axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1, size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
)
print(p)
Data Description:
- The chart displays the top 20 regions with the highest number of
properties, ranked in descending order.
- Tapping and Iluka have
the most property records, each close to 15 properties.
Distribution Characteristics:
- The top two regions have
slightly higher property counts, while the rest are relatively evenly
distributed, ranging from 10 to 15 properties.
Insights:
- High-frequency regions may be key areas for real
estate development.
- Further analysis of house prices and property
characteristics in these regions is recommended.
# Relationship between floor area and house price
p <- ggplot(data, aes(x = FLOOR_AREA, y = PRICE)) +
geom_point(alpha = 0.5, color = "lightblue") +
theme_classic() +
labs(title = "Relationship between FLOOR_AREA and PRICE", x = "FLOOR_AREA", y = "PRICE") +
theme(plot.title = element_text(hjust = 0.5, size = 14))
# Relationship between number of bedrooms and house price
p <- ggplot(data, aes(x = factor(BEDROOMS), y = PRICE)) +
geom_boxplot(fill = "lightblue", outlier.color = "black") +
theme_classic() +
labs(title = "Relationship between BEDROOMS and PRICE", x = "BEDROOMS", y = "PRICE") +
theme(plot.title = element_text(hjust = 0.5, size = 14))
print(p)
Distribution Characteristics:
- House prices (PRICE) are
positively correlated with floor area (FLOOR_AREA); larger areas are
associated with higher prices.
- Most houses have an area between
50-200 square meters, with prices concentrated in the range of 500,000
to 1,000,000.
High-Priced House Analysis:
- A small number of houses are
priced above 1,500,000, potentially influenced by location or other
features.
Summary:
- There is a clear positive correlation between house
prices and floor area, which can be utilized for predictive
analysis.
- Further analysis is recommended for high-priced houses
and those with areas exceeding 200 square meters to explore their
characteristics.
# Distribution of house prices
leaflet(data) %>%
addTiles() %>%
addCircleMarkers(
lng = ~LONGITUDE, lat = ~LATITUDE,
radius = 5,
color = ~colorNumeric("YlOrRd", PRICE)(PRICE),
stroke = FALSE, fillOpacity = 0.7,
popup = ~paste("Price: $", round(PRICE, 2))
) %>%
saveWidget(file = file.path(output_dir, "leaflet_map.html"), selfcontained = TRUE)
avg_price_by_suburb <- data %>%
group_by(SUBURB) %>%
summarise(mean_price = mean(PRICE, na.rm = TRUE)) %>%
arrange(desc(mean_price)) %>%
# Top 20 regions by house price
slice_head(n = 20)
p <- ggplot(avg_price_by_suburb, aes(x = reorder(SUBURB, -mean_price), y = mean_price)) +
geom_bar(stat = "identity", fill = "lightblue") +
theme_classic() +
labs(title = "Top 20 Suburbs by Average Price", x = "SUBURB", y = "Average Price") +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5, size = 14)
)
print(p)
Trends:
- The greater the number of bedrooms (BEDROOMS),
the higher the median house price (PRICE).
Distribution:
- Houses with 3 and 4 bedrooms are the most
common.
- A significant number of high-priced houses are found
among those with 4 bedrooms.
Summary:
- The number of bedrooms is an important factor
influencing house prices.
- Further analysis is needed to explore
the characteristics of high-priced houses.
data <- data %>%
mutate(SOLD_YEAR = as.numeric(format(as.Date(DATE_SOLD, "%Y-%m-%d"), "%Y")))
p <- data %>%
group_by(SOLD_YEAR) %>%
summarise(mean_price = mean(PRICE, na.rm = TRUE)) %>%
ggplot(aes(x = SOLD_YEAR, y = mean_price)) +
geom_line(color = "steelblue") +
theme_classic() +
labs(title = "Average House Price Over Years", x = "Year", y = "Average Price") +
theme(plot.title = element_text(hjust = 0.5, size = 14))
print(p)
Trends:
House prices rose rapidly from 2005 to 2010, peaking
in 2010. After 2010, house prices significantly declined and gradually
stabilized.
Fluctuations:
After 2015, house prices experienced slight
fluctuations but remained stable at a higher level overall.
Summary:House prices may be influenced by factors such as policies and
the economic environment. It is recommended to conduct further analysis
on the driving factors behind the peak in 2010 and the stable period
after 2015.
data <- data %>%
mutate(PriceCategory = case_when(
PRICE < quantile(PRICE, 0.25) ~ "Low",
PRICE >= quantile(PRICE, 0.25) & PRICE < quantile(PRICE, 0.75) ~ "Medium",
TRUE ~ "High"
))
ggplot(data, aes(x = PriceCategory, y = PRICE)) +
geom_boxplot(fill = "lightblue", alpha = 0.8) +
scale_y_continuous(trans = 'log10') +
theme_classic() +
labs(title = "Price Distribution by Category", x = "Price Category", y = "Price (log scale)")
4 Feature Engineering
Use Leaflet to Create an Interactive Map
(The Relationship Between Latitude/Longitude Features and House
Prices)
leaflet(all_perth_df) %>%
addTiles() %>%
addCircleMarkers(lng = ~LONGITUDE, lat = ~LATITUDE,
radius = 5,
color = ~colorNumeric("YlOrRd", PRICE)(PRICE),
stroke = FALSE, fillOpacity = 0.7,
popup = ~paste("Price: $", round(PRICE, 2)))
It can be observed from the map that high housing prices are mainly
concentrated in the western coastal areas and areas along the Swan
Rive.
Heatmap:
features_to_check <- c('PRICE', 'BEDROOMS', 'BATHROOMS', 'LAND_AREA', 'FLOOR_AREA', 'CBD_DIST', 'NEAREST_STN_DIST', 'NEAREST_SCH_DIST', 'HOUSE_AGE','POSTCODE')
selected_data <- all_perth_df[, features_to_check]
# correlation_matrix
correlation_matrix <- cor(selected_data, use = "complete.obs")
cor_melted <- as.data.frame(as.table(correlation_matrix))
colnames(cor_melted) <- c("Var1", "Var2", "value")
#heatmap
ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name = "Correlation") +
#geom_text(aes(label = round(value, 3)), color = "black", size = 4) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap of Features",
x = "Features",
y = "Features")
PRICE has a strong correlation with
FLOOR_AREA, BEDROOMS, and
BATHROOMS. The larger the floor area, the higher the house
price tends to be.
There is also a strong positive correlation
between BEDROOMS and BATHROOMS; typically,
houses with more bedrooms also have more bathrooms.
4 Modelling
4.1 House Price Classification In this task, we aim
to develop machine learning models to classify Perth’s second-hand
houses into different price levels: low price, medium price, and high
price. This classification allows us to better understand the
characteristics of houses in each category and provide insights for
buyers and real estate agents. We use Random Forest and Logistic
Regression models for this task.
# Load necessary library
# Set seed for reproducibility
set.seed(123)
# Create train-test split
trainIndex <- createDataPartition(all_perth_df$PRICE, p = 0.8, list = FALSE)
trainData <- all_perth_df[trainIndex, ]
testData <- all_perth_df[-trainIndex, ]
# Create consistent breaks from trainData
price_breaks <- quantile(trainData$PRICE, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE)
# Add Price_Category to trainData
trainData$Price_Category <- cut(
trainData$PRICE,
breaks = price_breaks,
labels = c("Low", "Medium", "High"),
include.lowest = TRUE
)
# Add Price_Category to testData
testData$Price_Category <- cut(
testData$PRICE,
breaks = price_breaks,
labels = c("Low", "Medium", "High"),
include.lowest = TRUE
)
# Train Random Forest model
# Load necessary library
# Train Random Forest model
set.seed(123)
rf_model <- randomForest(
Price_Category ~ BEDROOMS + BATHROOMS + GARAGE + LAND_AREA + FLOOR_AREA + BUILD_YEAR + CBD_DIST + NEAREST_STN_DIST + LATITUDE + LONGITUDE + NEAREST_SCH_DIST + SOLD_YEAR + HOUSE_AGE,
data = trainData,
importance = TRUE,
ntree = 100
)
# Predict on test data
testData$Predicted_Category_RF <- predict(rf_model, newdata = testData)
# Evaluate the Prediction Model
conf_matrix_rf <- confusionMatrix(testData$Predicted_Category_RF, testData$Price_Category)
print(conf_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Medium High
## Low 775 123 5
## Medium 150 727 100
## High 7 98 836
##
## Overall Statistics
##
## Accuracy : 0.8288
## 95% CI : (0.8144, 0.8425)
## No Information Rate : 0.3361
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7431
##
## Mcnemar's Test P-Value : 0.388
##
## Statistics by Class:
##
## Class: Low Class: Medium Class: High
## Sensitivity 0.8315 0.7669 0.8884
## Specificity 0.9322 0.8665 0.9441
## Pos Pred Value 0.8583 0.7441 0.8884
## Neg Pred Value 0.9181 0.8802 0.9441
## Prevalence 0.3304 0.3361 0.3336
## Detection Rate 0.2747 0.2577 0.2963
## Detection Prevalence 0.3201 0.3463 0.3336
## Balanced Accuracy 0.8819 0.8167 0.9163
The model achieved an overall accuracy of 82.88% with a 95% confidence interval of 81.44% to 84.25%, demonstrating strong predictive performance compared to a no-information rate of 33.61%. The kappa statistic of 0.7431 indicates substantial agreement between predictions and actual values. Class-wise performance metrics show that the model has high sensitivity for the “High” price category (88.84%) and reasonable sensitivity for “Low” (83.15%) and “Medium” (76.69%) categories. Specificity was consistently high across all categories, ranging from 86.65% for “Medium” to 94.41% for “High.” The balanced accuracy, which considers both sensitivity and specificity, was also strong, particularly for “High” (91.63%) and “Low” (88.19%). Overall, the model effectively distinguishes between the three price categories.
# Feature Importance Bar Chart
importance_rf <- as.data.frame(importance(rf_model))
importance_rf$Feature <- rownames(importance_rf)
importance_rf <- importance_rf[order(-importance_rf$MeanDecreaseGini), ]
ggplot(importance_rf, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(
title = "Feature Importance: Random Forest",
x = "Feature",
y = "Mean Decrease in Gini Index"
)
The feature importance bar chart identifies the key predictors in the
Random Forest model based on their contribution to reducing the Gini
index. The features at the top of the chart, such as
FLOOR_AREA, CBD_DIST, and
LONGITUDE, are the most influential in determining the
model’s classifications.
# Train Logistic Regression model
# Load necessary library
# Train Logistic Regression model
logit_model <- multinom(
Price_Category ~ BEDROOMS + BATHROOMS + GARAGE + LAND_AREA + FLOOR_AREA + CBD_DIST + NEAREST_STN_DIST + NEAREST_SCH_DIST + HOUSE_AGE,
data = trainData,
trace = FALSE
)
# Predict on test data
testData$Predicted_Category_Logit <- predict(logit_model, newdata = testData)
# Evaluate the Prediction Model
conf_matrix_logit <- confusionMatrix(testData$Predicted_Category_Logit, testData$Price_Category)
print(conf_matrix_logit)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Medium High
## Low 709 195 30
## Medium 190 571 202
## High 33 182 709
##
## Overall Statistics
##
## Accuracy : 0.7051
## 95% CI : (0.6879, 0.7219)
## No Information Rate : 0.3361
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5576
##
## Mcnemar's Test P-Value : 0.7412
##
## Statistics by Class:
##
## Class: Low Class: Medium Class: High
## Sensitivity 0.7607 0.6023 0.7535
## Specificity 0.8809 0.7907 0.8856
## Pos Pred Value 0.7591 0.5929 0.7673
## Neg Pred Value 0.8818 0.7971 0.8777
## Prevalence 0.3304 0.3361 0.3336
## Detection Rate 0.2513 0.2024 0.2513
## Detection Prevalence 0.3311 0.3414 0.3275
## Balanced Accuracy 0.8208 0.6965 0.8195
The model achieved an overall accuracy of 70.51% with a 95% confidence interval of 68.79% to 72.19%, significantly outperforming the no-information rate of 33.61%. The kappa statistic of 0.5576 indicates moderate agreement between predictions and actual values. Class-wise performance shows good sensitivity for “Low” (76.07%) and “High” (75.35%) categories but relatively lower sensitivity for the “Medium” category (60.23%). Specificity values are high across all categories, ranging from 79.07% for “Medium” to 88.56% for “High.” The balanced accuracy, which averages sensitivity and specificity, is strong for “Low” (82.08%) and “High” (81.95%) but lower for “Medium” (69.65%).
Random Forest model’s performance: High category: Strong performance with most samples correctly classified. Medium category: Good performance, still have some misclassification as low or high. Low category: Good performance, still have significant misclassification into Medium
Logistic regression model’s performance: High category: medium performance, with most instances misclassified as Medium (202 instances). Medium category: poor performance and more misclassified as Low or High Low category: medium performance and significant misclassification into Medium (709 instances), requiring improvement.
The Random Forest model performs well for the “High” category, with accurate predictions. However, there is overlap between “Low” and “Medium” categories, indicating potential misclassification. Similarly, the Logistic Regression model shows concentrated accuracy in the “High” category but struggles to distinguish between “Low” and “Medium,” with notable misclassification.
# Partial Dependence Plot for FLOOR_AREA
pdp_floor <- partial(rf_model, pred.var = "FLOOR_AREA", train = trainData)
plot_pdp_floor <- autoplot(pdp_floor, contour = FALSE) +
ggtitle("Partial Dependence: FLOOR_AREA") +
theme_minimal()
# Partial Dependence Plot for LAND_AREA
pdp_land <- partial(rf_model, pred.var = "LAND_AREA", train = trainData)
plot_pdp_land <- autoplot(pdp_land, contour = FALSE) +
ggtitle("Partial Dependence: LAND_AREA") +
theme_minimal()
# Combine the plots side by side
plot_pdp_floor + plot_pdp_land
The Dependence Plot for FLOOR_AREA shows that increasing floor area initially boosts property prices, but beyond a certain point, the effect diminishes and turns negative. The Dependence Plot for LAND_AREA indicates that larger land areas initially raise property values, but after peaking, their impact weakens and declines.
4.2 House Price Prediction
In this task, we aim to predict the
actual prices of Perth’s second-hand houses. Price prediction provides
essential insights for valuing properties accurately and is critical for
market analysis. We use LightGBM and XGBoost, two efficient
gradient-boosting models, to predict house prices based on features such
as FLOOR_AREA, LAND_AREA, CBD_DIST, and others.
# Load necessary library
# Set seed for reproducibility
set.seed(123)
# Create train-test split
trainIndex1 <- createDataPartition(all_perth_df$PRICE, p = 0.8, list = FALSE)
trainData1 <- all_perth_df[trainIndex, ]
testData1 <- all_perth_df[-trainIndex, ]
# Train LightGBM model
# Load necessary library
# Train LightGBM model
lgb_model <- lgb.train(
params = list(objective = "regression", metric = "rmse", seed = 42),
data = lgb.Dataset(data = as.matrix(trainData1 %>% select(-PRICE)), label = trainData1$PRICE),
nrounds = 100,
valids = list(train = lgb.Dataset(data = as.matrix(trainData1 %>% select(-PRICE)), label = trainData1$PRICE)),
early_stopping_rounds = 10,
verbose = -1
)
# Predict on test data
testData1$Predicted_Price_LGB <- predict(lgb_model, as.matrix(testData1 %>% select(-PRICE)))
# Evaluate the Prediction Model
evaluate_model <- function(actual, predicted) {
rmse_value <- sqrt(mean((actual - predicted)^2))
mae_value <- mean(abs(actual - predicted))
r2_value <- 1 - (sum((actual - predicted)^2) / sum((actual - mean(actual))^2))
list(RMSE = rmse_value, MAE = mae_value, R2 = r2_value)
}
results_lgb <- evaluate_model(testData1$PRICE, testData1$Predicted_Price_LGB)
print(results_lgb)
## $RMSE
## [1] 124339.9
##
## $MAE
## [1] 77274.79
##
## $R2
## [1] 0.8695867
The LightGBM regression model achieved solid performance, with an RMSE of 124,339.9, MAE of 77,274.79, and an R² of 0.8696, indicating a strong ability to explain the variance in housing prices. The model showed consistent improvement during training, reducing RMSE from 322,728 to 93,306.2 over 100 iterations. These results highlight the model’s effectiveness in capturing key patterns in the data, though additional tuning or feature refinement may further enhance accuracy.
# Train XGBoost model
# Load necessary library
# Define convert_to_numeric function
convert_to_numeric <- function(df) {
df[] <- lapply(df, function(col) {
if (is.factor(col) || is.character(col)) {
as.numeric(as.factor(col))
} else {
col
}
})
return(df)
}
# Convert train and test datasets to numeric
train_numeric <- convert_to_numeric(trainData1)
test_numeric <- convert_to_numeric(testData1)
# Select features and convert to matrix
features <- train_numeric %>%
select(-PRICE) %>%
colnames()
train_matrix <- as.matrix(train_numeric[, features])
test_matrix <- as.matrix(test_numeric[, features])
# Train XGBoost model
# Train XGBoost model with print_every_n
xgb_model <- xgb.train(
params = list(objective = "reg:squarederror", max_depth = 6, eta = 0.3, subsample = 0.8),
data = xgb.DMatrix(data = train_matrix, label = train_numeric$PRICE),
nrounds = 100,
watchlist = list(train = xgb.DMatrix(data = train_matrix, label = train_numeric$PRICE)),
early_stopping_rounds = 10,
verbose = 0
)
# Predict on test data
test_numeric$Predicted_Price_XGB <- predict(xgb_model, xgb.DMatrix(data = test_matrix))
# Evaluate the Prediction Model
evaluate_model <- function(actual, predicted) {
rmse_value <- sqrt(mean((actual - predicted)^2))
mae_value <- mean(abs(actual - predicted))
r2_value <- 1 - (sum((actual - predicted)^2) / sum((actual - mean(actual))^2))
list(RMSE = rmse_value, MAE = mae_value, R2 = r2_value)
}
results_xgb <- evaluate_model(test_numeric$PRICE, test_numeric$Predicted_Price_XGB)
print(results_xgb)
## $RMSE
## [1] 137848.4
##
## $MAE
## [1] 90908.12
##
## $R2
## [1] 0.8397108
The XGBoost model demonstrates its ability to predict house prices effectively, as indicated by its evaluation metrics. During training, the model’s RMSE steadily decreased from an initial value of 538,480 to 56,489 over 100 iterations, showcasing its learning and optimization process. When applied to test data, the model achieved an RMSE of 137,848.4, an MAE of 90,908.12, and an \(R^2\) value of 0.8397. These results highlight the model’s capability to minimize error and explain a significant proportion of the variance in house prices, confirming its suitability for the regression task.
The RMSE (Root Mean Square Error) for both two models decreasing steadily with each iteration, indicating consistent improvement in prediction accuracy during training. The LightGBM learning curve shows a steady RMSE decrease, with rapid error reduction initially and gradual convergence, indicating consistent accuracy improvement. Similarly, the XGBoost curve highlights sharp early RMSE declines, stabilizing over iterations, demonstrating effective learning and error minimization without overfitting.
The scatter plot shows predicted prices from LightGBM and XGBoost closely clustering around the diagonal, indicating strong and comparable accuracy, with slight deviations at higher price ranges. The histogram of residuals reveals both models’ errors are centered near zero, with LightGBM showing slightly tighter accuracy. The density plot highlights similar prediction trends between the models, with minor variations in spread, demonstrating comparable performance in property price forecasting.
# Combine evaluation metrics into a data frame
model_comparison <- data.frame(
Metric = c("RMSE", "MAE", "R2"),
LightGBM = c(results_lgb$RMSE, results_lgb$MAE, results_lgb$R2),
XGBoost = c(results_xgb$RMSE, results_xgb$MAE, results_xgb$R2)
)
# Convert to long format for ggplot2
model_comparison_long <- pivot_longer(model_comparison, cols = c("LightGBM", "XGBoost"), names_to = "Model", values_to = "Value")
# Plot the comparison
ggplot(model_comparison_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
theme_minimal() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Model Performance Comparison: LightGBM vs XGBoost",
y = "Value",
x = "Metric"
) +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) + # Keep axis text horizontal
geom_text(aes(label = round(Value, 2)), position = position_dodge(width = 0.7), vjust = -0.5, size = 3.5) + # Add values above bars
ylim(0, max(model_comparison_long$Value) * 1.2) # Add more padding to the y-axis
This bar chart compares the performance of LightGBM and XGBoost models using three evaluation metrics: RMSE, MAE, and R². LightGBM outperforms XGBoost in both RMSE (124,339.86 vs. 137,848.35) and MAE (77,274.79 vs. 90,908.12), indicating it provides better overall predictions. However, both models exhibit comparable R² scores (0.87 for LightGBM and 0.84 for XGBoost), highlighting similar capabilities in explaining variance. The chart visually emphasizes LightGBM’s marginally superior accuracy for this dataset.
The bar charts reveal feature importance for LightGBM and XGBoost in predicting property prices. Both models rank “Postcode,” “Floor Area,” and “CBD Distance” as the most critical factors, followed by “Longitude” and “Latitude,” emphasizing spatial influences. Features like “Bedrooms” has minimal impact, providing insights into key variables driving predictions.
5 Conclusion
Key Determinants of Housing Prices:
Features
such as FLOOR_AREA, LAND_AREA, CBD_DIST, and HOUSE_AGE play significant
roles in determining house prices.
Model:
1. The Random Forest
model outperforms Logistic Regression in classification tasks
2.
Gradient boosting models (LightGBM and XGBoost) provide robust
frameworks for predicting house prices
Actionable Insights for
Stakeholders:
1. Real estate agents can focus on key features such
as floor area, proximity to the CBD, and location for accurate
pricing.
2. Buyers can make informed decisions based on historical
trends and location-based analysis.
Reference
Mohd, T., Jamil, N.S., Johari, N., Abdullah, L.,
Masrom, S. (2020). An Overview of Real Estate Modelling Techniques for
House Price Prediction. Charting a Sustainable Future of ASEAN in
Business and Social Sciences. https://doi.org/10.1007/978-981-15-3859-9_28
Satish, G. N., Raghavendran, C. V., Rao, M. S., & Srinivasulu, C.
(2019). House price prediction using machine learning. Journal of
Innovative Technology and Exploring Engineering, 8(9), 717-722. DOI:10.35940/ijitee.I7849.078919
Zhang, Y., Huang,
J., Zhang, J., Liu, S., & Shorman, S. (2022). Analysis and
prediction of second-hand house price based on random forest. Applied
Mathematics and Nonlinear Sciences, 7(1), 27-42. https://doi.org/10.2478/amns.2022.1.00052
Couronné,
R., Probst, P. & Boulesteix, AL. (2018). Random forest versus
logistic regression: a large-scale benchmark experiment. BMC
Bioinformatics 19, 270. https://doi.org/10.1186/s12859-018-2264-5
Daghistani, T., & Alshammari, R. (2020). Comparison of statistical
logistic regression and random forest machine learning techniques in
predicting diabetes. Journal of Advances in Information Technology
Vol, 11(2), 78-83. doi: 10.12720/jait.11.2.78-83 Zhang,D., &
Gong,Y.(2020). The Comparison of LightGBM and XGBoost Coupling Factor
Analysis and Prediagnosis of Acute Liver Failure. IEEE Access, 8,
220990-221003. doi:10.1109/access.2020.3042848
Gray, D. (2013).
House Price Diffusion: An Application of Spectral Analysis to the Prices
of Irish Second-Hand Dwellings. Housing Studies, 28(6), 869–890. https://doi.org/10.1080/02673037.2013.768335
Feng,
M., Duan, Y., Wang, X., Zhang, J., & Ma, L. (2023). Carbon price
prediction based on decomposition technique and extreme gradient
boosting optimized by the grey wolf optimizer algorithm. Scientific
reports, 13(1), 18447. https://doi.org/10.1038/s41598-023-45524-2
C. R.
Madhuri, G. Anuradha and M. V. Pujitha. (2019). House Price Prediction
Using Regression Techniques: A Comparative Study. International
Conference on Smart Structures and Systems (ICSSS), Chennai, India,
2019, pp. 1-5, doi: 10.1109/ICSSS.2019.8882834.
Hu, Y. (2022). The
Impact of COVID-19 on Perth House Price: A Machine Learning Perspective.
Proceedings of the 2022 7th International Conference on Financial
Innovation and Economic Development (ICFIED 2022). https://doi.org/10.2991/aebmr.k.220307.203