Predicting Perth’s second-hand house prices based on ML algorithms

1 Introduction
The real estate market is not only directly related to the healthy development of the country’s economy, but also closely related to people’s lives. Nowadays, with the process of urbanization, the proportion of the second-hand housing market is gradually increasing. The second-hand housing market can reflect the supply and demand relationship of the real estate market. This project will use Perth, Australia as an example to carry out research on second-hand housing prices. Data analysis. Aimed at providing data guidance to market suppliers and consumers.

1.1 Objectives
(1) Predicting second-hand house prices in Perth based on machine learning algorithms.
(2) To classify Perth second-hand house into low-priced, medium-priced, high-priced category based on various factors with classification algorithm.

2 Methods and Analysis
2.1 Data Preparation
The dataset was obtained from https://www.kaggle.com/datasets/syuzai/perth-house-prices. This data was scraped from http://house.speakingsame.com/ and includes data from 322 Perth suburbs, resulting in an average of about 100 rows per suburb.

2.2 Data Exploration
Load data

head(all_perth_df)
##               ADDRESS       SUBURB  PRICE BEDROOMS BATHROOMS GARAGE LAND_AREA
## 1       1 Acorn Place   South Lake 565000        4         2      2       600
## 2         1 Addis Way        Wandi 365000        3         2      2       351
## 3     1 Ainsley Court      Camillo 287000        3         1      1       719
## 4     1 Albert Street     Bellevue 255000        2         1      2       651
## 5        1 Aman Place    Lockridge 325000        4         1      2       466
## 6 1 Amethyst Crescent Mount Richon 409000        4         2      1       759
##   FLOOR_AREA BUILD_YEAR CBD_DIST              NEAREST_STN NEAREST_STN_DIST
## 1        160       2003    18300 Cockburn Central Station             1800
## 2        139       2013    26900          Kwinana Station             4900
## 3         86       1979    22600          Challis Station             1900
## 4         59       1953    17900          Midland Station             3600
## 5        131       1998    11200       Bassendean Station             2000
## 6        118       1991    27300         Armadale Station             1000
##   DATE_SOLD POSTCODE  LATITUDE LONGITUDE                  NEAREST_SCH
## 1   09-2018     6164 -32.11590  115.8424  LAKELAND SENIOR HIGH SCHOOL
## 2   02-2019     6167 -32.19347  115.8596               ATWELL COLLEGE
## 3   06-2015     6111 -32.12058  115.9936 KELMSCOTT SENIOR HIGH SCHOOL
## 4   07-2018     6056 -31.90055  116.0380 SWAN VIEW SENIOR HIGH SCHOOL
## 5   11-2016     6054 -31.88579  115.9478                KIARA COLLEGE
## 6   03-2013     6112 -32.15380  116.0237  ARMADALE SENIOR HIGH SCHOOL
##   NEAREST_SCH_DIST NEAREST_SCH_RANK
## 1        0.8283386               NA
## 2        5.5243244              129
## 3        1.6491782              113
## 4        1.5714009               NA
## 5        1.5149216               NA
## 6        1.2272192               NA
colnames(all_perth_df)
##  [1] "ADDRESS"          "SUBURB"           "PRICE"            "BEDROOMS"        
##  [5] "BATHROOMS"        "GARAGE"           "LAND_AREA"        "FLOOR_AREA"      
##  [9] "BUILD_YEAR"       "CBD_DIST"         "NEAREST_STN"      "NEAREST_STN_DIST"
## [13] "DATE_SOLD"        "POSTCODE"         "LATITUDE"         "LONGITUDE"       
## [17] "NEAREST_SCH"      "NEAREST_SCH_DIST" "NEAREST_SCH_RANK"
str(all_perth_df)
## 'data.frame':    33656 obs. of  19 variables:
##  $ ADDRESS         : chr  "1 Acorn Place" "1 Addis Way" "1 Ainsley Court" "1 Albert Street" ...
##  $ SUBURB          : chr  "South Lake" "Wandi" "Camillo" "Bellevue" ...
##  $ PRICE           : int  565000 365000 287000 255000 325000 409000 400000 370000 565000 685000 ...
##  $ BEDROOMS        : int  4 3 3 2 4 4 3 4 4 3 ...
##  $ BATHROOMS       : int  2 2 1 1 1 2 2 2 2 2 ...
##  $ GARAGE          : chr  "2" "2" "1" "2" ...
##  $ LAND_AREA       : int  600 351 719 651 466 759 386 468 875 552 ...
##  $ FLOOR_AREA      : int  160 139 86 59 131 118 132 158 168 126 ...
##  $ BUILD_YEAR      : chr  "2003" "2013" "1979" "1953" ...
##  $ CBD_DIST        : int  18300 26900 22600 17900 11200 27300 28200 41700 12100 5900 ...
##  $ NEAREST_STN     : chr  "Cockburn Central Station" "Kwinana Station" "Challis Station" "Midland Station" ...
##  $ NEAREST_STN_DIST: int  1800 4900 1900 3600 2000 1000 3700 1100 2500 508 ...
##  $ DATE_SOLD       : chr  "09-2018" "02-2019" "06-2015" "07-2018" ...
##  $ POSTCODE        : int  6164 6167 6111 6056 6054 6112 6112 6169 6022 6053 ...
##  $ LATITUDE        : num  -32.1 -32.2 -32.1 -31.9 -31.9 ...
##  $ LONGITUDE       : num  116 116 116 116 116 ...
##  $ NEAREST_SCH     : chr  "LAKELAND SENIOR HIGH SCHOOL" "ATWELL COLLEGE" "KELMSCOTT SENIOR HIGH SCHOOL" "SWAN VIEW SENIOR HIGH SCHOOL" ...
##  $ NEAREST_SCH_DIST: num  0.828 5.524 1.649 1.571 1.515 ...
##  $ NEAREST_SCH_RANK: int  NA 129 113 NA NA NA NA NA NA 29 ...

The dataset contains a total of 33,656 entries.
The features include:ADDRESS,SUBURB,PRICE,BEDROOMS,BATHROOMS,GARAGE,LAND_AREA,FLOOR_AREA,BUILD_YEAR, CBD_DIST, NEAREST_STN, NEAREST_STN_DIST, DATE_SOLD, POSTCODE, LATITUDE, LONGITUDE, NEAREST_SCH, NEAREST_SCH_DIST, NEAREST_SCH_RANK.
Address: The specific location of the property.
- Convert the address into latitude (LATITUDE) and longitude (LONGITUDE) to provide geospatial information for analysis.

SUBURB: The neighborhood where the property is located.

Price: The actual selling price of the property.
- Remove this feature in the test set, as price is the target variable and should not be directly used by the model.

Bedrooms: The number of bedrooms in the property.
- The number of bedrooms directly affects the functionality and value of the property, making it a key predictor of price.

Bathrooms: The number of bathrooms in the property.
- The number of bathrooms relates to the comfort and usability of the house, influencing its price.

Garage: The number of garages associated with the property.
- The number of garages reflects parking convenience and adds value to the property, impacting its price.

Floor Area: The indoor usable area of the property (in square meters).
- Floor area is a critical determinant of property value and is usually positively correlated with price.

LAND_AREA: The shared area of the community where the property is located.
- Common areas may influence the quality of community amenities, indirectly affecting property prices.

BUILD_YEAR: The year the property was constructed, ranging from 1868 to 2017.
- Indicates the property’s age, which can be calculated with the selling date. The age of a property directly affects its condition and value.

CBD_DIST: The distance from the property to the nearest commercial area (in kilometers).
- Properties closer to commercial areas are generally more valuable due to better accessibility.

NEAREST_STN_DIST:The distance from the property to the nearest subway station (in kilometers).
- Proximity to subway stations often increases a property’s appeal, as convenient commuting adds value.

DATE_SOLD: The date the property was sold, ranging from 1990 to 2020.
- Sale date captures the effect of time trends on property prices, which tend to fluctuate over time.

NEAREST_SCH_DIST: The distance from the property to the nearest school (in kilometers).
- Access to educational resources is a key factor affecting property value, with properties closer to schools often being more desirable.

NEAREST_SCH_RANK: The ranking of nearby schools.

summary(all_perth_df)  
##    ADDRESS             SUBURB              PRICE            BEDROOMS     
##  Length:33656       Length:33656       Min.   :  51000   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 410000   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Median : 535500   Median : 4.000  
##                                        Mean   : 637072   Mean   : 3.659  
##                                        3rd Qu.: 760000   3rd Qu.: 4.000  
##                                        Max.   :2440000   Max.   :10.000  
##                                                                          
##    BATHROOMS         GARAGE            LAND_AREA        FLOOR_AREA   
##  Min.   : 1.000   Length:33656       Min.   :    61   Min.   :  1.0  
##  1st Qu.: 1.000   Class :character   1st Qu.:   503   1st Qu.:130.0  
##  Median : 2.000   Mode  :character   Median :   682   Median :172.0  
##  Mean   : 1.823                      Mean   :  2741   Mean   :183.5  
##  3rd Qu.: 2.000                      3rd Qu.:   838   3rd Qu.:222.2  
##  Max.   :16.000                      Max.   :999999   Max.   :870.0  
##                                                                      
##   BUILD_YEAR           CBD_DIST     NEAREST_STN        NEAREST_STN_DIST
##  Length:33656       Min.   :  681   Length:33656       Min.   :   46   
##  Class :character   1st Qu.:11200   Class :character   1st Qu.: 1800   
##  Mode  :character   Median :17500   Mode  :character   Median : 3200   
##                     Mean   :19777                      Mean   : 4523   
##                     3rd Qu.:26600                      3rd Qu.: 5300   
##                     Max.   :59800                      Max.   :35500   
##                                                                        
##   DATE_SOLD            POSTCODE       LATITUDE        LONGITUDE    
##  Length:33656       Min.   :6003   Min.   :-32.47   Min.   :115.6  
##  Class :character   1st Qu.:6050   1st Qu.:-32.07   1st Qu.:115.8  
##  Mode  :character   Median :6069   Median :-31.93   Median :115.9  
##                     Mean   :6089   Mean   :-31.96   Mean   :115.9  
##                     3rd Qu.:6150   3rd Qu.:-31.84   3rd Qu.:116.0  
##                     Max.   :6558   Max.   :-31.46   Max.   :116.3  
##                                                                    
##  NEAREST_SCH        NEAREST_SCH_DIST   NEAREST_SCH_RANK
##  Length:33656       Min.   : 0.07091   Min.   :  1.00  
##  Class :character   1st Qu.: 0.88057   1st Qu.: 39.00  
##  Mode  :character   Median : 1.34552   Median : 68.00  
##                     Mean   : 1.81527   Mean   : 72.67  
##                     3rd Qu.: 2.09722   3rd Qu.:105.00  
##                     Max.   :23.25437   Max.   :139.00  
##                                        NA's   :10952

2.3 Data Cleaning
2.3.1 Duplicate values

Check if the dataset contains duplicate values.
Check for duplicates (if have duplicates, return TRUE).

duplicates <- duplicated(all_perth_df)
any(duplicates) 
## [1] FALSE

2.3.2 Missing values
Data transformation: Convert BUILD_YEAR and DATE_SOLD to date format.

#  BUILD_YEAR transfer date
all_perth_df$BUILD_YEAR <- as.Date(paste(all_perth_df$BUILD_YEAR, "01-01", sep = "-"), format = "%Y-%m-%d")
#  GARAGE  NA
all_perth_df$GARAGE <- as.numeric(gsub("[^0-9]", NA, all_perth_df$GARAGE))

#  DATE_SOLD transfer date
all_perth_df$DATE_SOLD <- as.Date(paste("01", all_perth_df$DATE_SOLD, sep = "-"), format = "%d-%m-%Y")

Check for missing values in each feature and calculate the number of missing values for each feature.

null_counts <- colSums(is.na(all_perth_df))
null_counts
##          ADDRESS           SUBURB            PRICE         BEDROOMS 
##                0                0                0                0 
##        BATHROOMS           GARAGE        LAND_AREA       FLOOR_AREA 
##                0             2478                0                0 
##       BUILD_YEAR         CBD_DIST      NEAREST_STN NEAREST_STN_DIST 
##             3155                0                0                0 
##        DATE_SOLD         POSTCODE         LATITUDE        LONGITUDE 
##                0                0                0                0 
##      NEAREST_SCH NEAREST_SCH_DIST NEAREST_SCH_RANK 
##                0                0            10952
  1. Delete the ‘NEAREST_SCH_RANK’ column.
  2. Delete NA values in the GARAGE and BUILD_YEAR features.
all_perth_df <- all_perth_df[, !(colnames(all_perth_df) %in% 'NEAREST_SCH_RANK')]

all_perth_df <- all_perth_df[!is.na(all_perth_df$GARAGE) & !is.na(all_perth_df$BUILD_YEAR), ]

str(all_perth_df)
## 'data.frame':    28256 obs. of  18 variables:
##  $ ADDRESS         : chr  "1 Acorn Place" "1 Addis Way" "1 Ainsley Court" "1 Albert Street" ...
##  $ SUBURB          : chr  "South Lake" "Wandi" "Camillo" "Bellevue" ...
##  $ PRICE           : int  565000 365000 287000 255000 325000 409000 400000 370000 565000 685000 ...
##  $ BEDROOMS        : int  4 3 3 2 4 4 3 4 4 3 ...
##  $ BATHROOMS       : int  2 2 1 1 1 2 2 2 2 2 ...
##  $ GARAGE          : num  2 2 1 2 2 1 2 2 3 8 ...
##  $ LAND_AREA       : int  600 351 719 651 466 759 386 468 875 552 ...
##  $ FLOOR_AREA      : int  160 139 86 59 131 118 132 158 168 126 ...
##  $ BUILD_YEAR      : Date, format: "2003-01-01" "2013-01-01" ...
##  $ CBD_DIST        : int  18300 26900 22600 17900 11200 27300 28200 41700 12100 5900 ...
##  $ NEAREST_STN     : chr  "Cockburn Central Station" "Kwinana Station" "Challis Station" "Midland Station" ...
##  $ NEAREST_STN_DIST: int  1800 4900 1900 3600 2000 1000 3700 1100 2500 508 ...
##  $ DATE_SOLD       : Date, format: "2018-09-01" "2019-02-01" ...
##  $ POSTCODE        : int  6164 6167 6111 6056 6054 6112 6112 6169 6022 6053 ...
##  $ LATITUDE        : num  -32.1 -32.2 -32.1 -31.9 -31.9 ...
##  $ LONGITUDE       : num  116 116 116 116 116 ...
##  $ NEAREST_SCH     : chr  "LAKELAND SENIOR HIGH SCHOOL" "ATWELL COLLEGE" "KELMSCOTT SENIOR HIGH SCHOOL" "SWAN VIEW SENIOR HIGH SCHOOL" ...
##  $ NEAREST_SCH_DIST: num  0.828 5.524 1.649 1.571 1.515 ...

The results show that GARAGE has 2,478 missing values, BUILD_YEAR has 3,155 missing values, and NEAREST_SCH_DIST has 10,952 missing values.
Due to the excessive number of missing values in the school ranking, the column has been deleted.
The missing values in the GARAGE and BUILD_YEAR columns have been removed. After deletion, the dataset now contains 28,256 entries.

2.3.3 Outliers

Check for outliers in the following features: number of bedrooms, number of bathrooms, number of garages, floor area, common area, and the three distance features.
Calculate the IQR (Interquartile Range) to define the outlier range.

features_to_check <- c('BEDROOMS', 'LAND_AREA', 'FLOOR_AREA', 'CBD_DIST', 'NEAREST_STN_DIST', 'NEAREST_SCH_DIST','GARAGE')

par(mfrow = c(3, 3), mar = c(5, 5, 2, 1))  

for (feature in features_to_check) {
  boxplot(all_perth_df[[feature]], main = paste('Boxplot of', feature), col = 'lightblue', horizontal = TRUE, 
          xlab = feature, ylab = 'Value')
}

par(mfrow = c(1, 1))

Then delete all outliers

Add new feature columns
HOUSE_AGE (New Feature): The age of the property at the time of sale, calculated as:
{HOUSE_AGE} = {Year of Sale} - {Year Built}
- HOUSE_AGE reflects its condition, which typically influences its price.

all_perth_df$SOLD_YEAR <- as.numeric(format(all_perth_df$DATE_SOLD, "%Y"))  
all_perth_df$BUILD_YEAR <- as.numeric(format(all_perth_df$BUILD_YEAR, "%Y")) 
# HOUSE_AGE
all_perth_df$HOUSE_AGE <- all_perth_df$SOLD_YEAR - all_perth_df$BUILD_YEAR

head(all_perth_df$HOUSE_AGE)
## [1] 15 65 18  2  6 15


3 EDA

features_to_plot <- c('PRICE', 'BEDROOMS', 'BATHROOMS', 'LAND_AREA', 'FLOOR_AREA', 'HOUSE_AGE')

missing_features <- setdiff(features_to_plot, colnames(data))
if (length(missing_features) > 0) {
  stop("not:", paste(missing_features, collapse = ", "))
}

plots <- list()  

for (feature in features_to_plot) {
  if (!is.numeric(data[[feature]])) {
    warning(paste("jump:", feature))
    next
  }

  p <- ggplot(data, aes_string(x = feature)) +
    geom_histogram(fill = "lightblue", alpha = 0.8, bins = 30) +  
    theme_classic() +  
    labs(
      title = paste("Distribution of", feature),
      x = feature,
      y = "Frequency"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), 
      axis.title = element_text(size = 12),
      axis.text = element_text(size = 8)   
    )

  plots[[feature]] <- p  
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
do.call(grid.arrange, c(plots, ncol = 2))  


Distribution Characteristics:
- The distribution of house age (HOUSE_AGE) is right-skewed, with most houses being 0-40 years old.
- Houses aged 10-20 years are the most common, indicating a relatively high proportion of newly built homes.
- A small number of houses are over 80 years old, with some even reaching 100 years, possibly representing historical buildings or older homes that have not been renovated or rebuilt.

Data Implications:
- The concentration of relatively low house ages suggests significant real estate development in the area in recent years.
- The small number of older houses may belong to specific areas or have historical value.

Data Applications:
- House age could be a key feature for predicting house prices, as newer houses are likely to have higher values.
- Further analysis of the relationship between house age and price is recommended to confirm its impact on property values.

suburb_count <- data %>%
  group_by(SUBURB) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice_head(n = 20)

p <- ggplot(suburb_count, aes(x = reorder(SUBURB, -count), y = count)) +
  geom_bar(stat = "identity", fill ="lightblue", alpha = 0.8) +  
  theme_classic() + 
  labs(
    title = "Top 20 Suburbs by Frequency",  
    x = "SUBURB",                          
    y = "Count"                            
  ) +
  theme(
    axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1, size = 10), 
    axis.title = element_text(size = 12),                                    
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold")       
  ) 


print(p) 


Data Description:

- The chart displays the top 20 regions with the highest number of properties, ranked in descending order.
- Tapping and Iluka have the most property records, each close to 15 properties.

Distribution Characteristics:
- The top two regions have slightly higher property counts, while the rest are relatively evenly distributed, ranging from 10 to 15 properties.

Insights:
- High-frequency regions may be key areas for real estate development.
- Further analysis of house prices and property characteristics in these regions is recommended.

# Relationship between floor area and house price
p <- ggplot(data, aes(x = FLOOR_AREA, y = PRICE)) +
  geom_point(alpha = 0.5, color = "lightblue") +
  theme_classic() +
  labs(title = "Relationship between FLOOR_AREA and PRICE", x = "FLOOR_AREA", y = "PRICE") +
  theme(plot.title = element_text(hjust = 0.5, size = 14))

# Relationship between number of bedrooms and house price
p <- ggplot(data, aes(x = factor(BEDROOMS), y = PRICE)) +
  geom_boxplot(fill = "lightblue", outlier.color = "black") +
  theme_classic() +
  labs(title = "Relationship between BEDROOMS and PRICE", x = "BEDROOMS", y = "PRICE") +
  theme(plot.title = element_text(hjust = 0.5, size = 14))

print(p)


Distribution Characteristics:
- House prices (PRICE) are positively correlated with floor area (FLOOR_AREA); larger areas are associated with higher prices.
- Most houses have an area between 50-200 square meters, with prices concentrated in the range of 500,000 to 1,000,000.

High-Priced House Analysis:
- A small number of houses are priced above 1,500,000, potentially influenced by location or other features.

Summary:
- There is a clear positive correlation between house prices and floor area, which can be utilized for predictive analysis.
- Further analysis is recommended for high-priced houses and those with areas exceeding 200 square meters to explore their characteristics.

# Distribution of house prices
leaflet(data) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~LONGITUDE, lat = ~LATITUDE,
    radius = 5,
    color = ~colorNumeric("YlOrRd", PRICE)(PRICE),
    stroke = FALSE, fillOpacity = 0.7,
    popup = ~paste("Price: $", round(PRICE, 2))
  ) %>%
  saveWidget(file = file.path(output_dir, "leaflet_map.html"), selfcontained = TRUE)


avg_price_by_suburb <- data %>%
  group_by(SUBURB) %>%
  summarise(mean_price = mean(PRICE, na.rm = TRUE)) %>%
  arrange(desc(mean_price)) %>%
# Top 20 regions by house price
  slice_head(n = 20)  

p <- ggplot(avg_price_by_suburb, aes(x = reorder(SUBURB, -mean_price), y = mean_price)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  theme_classic() +
  labs(title = "Top 20 Suburbs by Average Price", x = "SUBURB", y = "Average Price") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    plot.title = element_text(hjust = 0.5, size = 14)
  )

print(p)


Trends:
- The greater the number of bedrooms (BEDROOMS), the higher the median house price (PRICE).

Distribution:
- Houses with 3 and 4 bedrooms are the most common.
- A significant number of high-priced houses are found among those with 4 bedrooms.

Summary:
- The number of bedrooms is an important factor influencing house prices.
- Further analysis is needed to explore the characteristics of high-priced houses.

data <- data %>%
  mutate(SOLD_YEAR = as.numeric(format(as.Date(DATE_SOLD, "%Y-%m-%d"), "%Y")))

p <- data %>%
  group_by(SOLD_YEAR) %>%
  summarise(mean_price = mean(PRICE, na.rm = TRUE)) %>%
  ggplot(aes(x = SOLD_YEAR, y = mean_price)) +
  geom_line(color = "steelblue") +
  theme_classic() +
  labs(title = "Average House Price Over Years", x = "Year", y = "Average Price") +
  theme(plot.title = element_text(hjust = 0.5, size = 14))

print(p)


Trends:
House prices rose rapidly from 2005 to 2010, peaking in 2010. After 2010, house prices significantly declined and gradually stabilized.

Fluctuations:
After 2015, house prices experienced slight fluctuations but remained stable at a higher level overall.
Summary:House prices may be influenced by factors such as policies and the economic environment. It is recommended to conduct further analysis on the driving factors behind the peak in 2010 and the stable period after 2015.

data <- data %>%  
  mutate(PriceCategory = case_when(  
    PRICE < quantile(PRICE, 0.25) ~ "Low",  
    PRICE >= quantile(PRICE, 0.25) & PRICE < quantile(PRICE, 0.75) ~ "Medium",  
    TRUE ~ "High"  
  ))  

ggplot(data, aes(x = PriceCategory, y = PRICE)) +  
  geom_boxplot(fill = "lightblue", alpha = 0.8) +  
  scale_y_continuous(trans = 'log10') + 
  theme_classic() +  
  labs(title = "Price Distribution by Category", x = "Price Category", y = "Price (log scale)")

4 Feature Engineering
Use Leaflet to Create an Interactive Map (The Relationship Between Latitude/Longitude Features and House Prices)

leaflet(all_perth_df) %>%
  addTiles() %>%  
  addCircleMarkers(lng = ~LONGITUDE, lat = ~LATITUDE, 
                   radius = 5,  
                   color = ~colorNumeric("YlOrRd", PRICE)(PRICE), 
                   stroke = FALSE, fillOpacity = 0.7,  
                   popup = ~paste("Price: $", round(PRICE, 2))) 

It can be observed from the map that high housing prices are mainly concentrated in the western coastal areas and areas along the Swan Rive.

Heatmap:

features_to_check <- c('PRICE', 'BEDROOMS', 'BATHROOMS', 'LAND_AREA', 'FLOOR_AREA', 'CBD_DIST', 'NEAREST_STN_DIST', 'NEAREST_SCH_DIST', 'HOUSE_AGE','POSTCODE')

selected_data <- all_perth_df[, features_to_check]

# correlation_matrix
correlation_matrix <- cor(selected_data, use = "complete.obs")  
cor_melted <- as.data.frame(as.table(correlation_matrix))
colnames(cor_melted) <- c("Var1", "Var2", "value")

#heatmap
ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name = "Correlation") +
  #geom_text(aes(label = round(value, 3)), color = "black", size = 4) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap of Features",
       x = "Features",
       y = "Features")

PRICE has a strong correlation with FLOOR_AREA, BEDROOMS, and BATHROOMS. The larger the floor area, the higher the house price tends to be.
There is also a strong positive correlation between BEDROOMS and BATHROOMS; typically, houses with more bedrooms also have more bathrooms.

4 Modelling
4.1 House Price Classification In this task, we aim to develop machine learning models to classify Perth’s second-hand houses into different price levels: low price, medium price, and high price. This classification allows us to better understand the characteristics of houses in each category and provide insights for buyers and real estate agents. We use Random Forest and Logistic Regression models for this task.

# Load necessary library

# Set seed for reproducibility
set.seed(123)

# Create train-test split
trainIndex <- createDataPartition(all_perth_df$PRICE, p = 0.8, list = FALSE)
trainData <- all_perth_df[trainIndex, ]
testData <- all_perth_df[-trainIndex, ]
# Create consistent breaks from trainData
price_breaks <- quantile(trainData$PRICE, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE)

# Add Price_Category to trainData
trainData$Price_Category <- cut(
  trainData$PRICE,
  breaks = price_breaks,
  labels = c("Low", "Medium", "High"),
  include.lowest = TRUE
)

# Add Price_Category to testData
testData$Price_Category <- cut(
  testData$PRICE,
  breaks = price_breaks,
  labels = c("Low", "Medium", "High"),
  include.lowest = TRUE
)
# Train Random Forest model
# Load necessary library

# Train Random Forest model
set.seed(123)
rf_model <- randomForest(
  Price_Category ~ BEDROOMS + BATHROOMS + GARAGE + LAND_AREA + FLOOR_AREA + BUILD_YEAR + CBD_DIST + NEAREST_STN_DIST + LATITUDE + LONGITUDE + NEAREST_SCH_DIST + SOLD_YEAR + HOUSE_AGE,
  data = trainData,
  importance = TRUE,
  ntree = 100
)

# Predict on test data
testData$Predicted_Category_RF <- predict(rf_model, newdata = testData)

# Evaluate the Prediction Model
conf_matrix_rf <- confusionMatrix(testData$Predicted_Category_RF, testData$Price_Category)
print(conf_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Medium High
##     Low    775    123    5
##     Medium 150    727  100
##     High     7     98  836
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8288          
##                  95% CI : (0.8144, 0.8425)
##     No Information Rate : 0.3361          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7431          
##                                           
##  Mcnemar's Test P-Value : 0.388           
## 
## Statistics by Class:
## 
##                      Class: Low Class: Medium Class: High
## Sensitivity              0.8315        0.7669      0.8884
## Specificity              0.9322        0.8665      0.9441
## Pos Pred Value           0.8583        0.7441      0.8884
## Neg Pred Value           0.9181        0.8802      0.9441
## Prevalence               0.3304        0.3361      0.3336
## Detection Rate           0.2747        0.2577      0.2963
## Detection Prevalence     0.3201        0.3463      0.3336
## Balanced Accuracy        0.8819        0.8167      0.9163

The model achieved an overall accuracy of 82.88% with a 95% confidence interval of 81.44% to 84.25%, demonstrating strong predictive performance compared to a no-information rate of 33.61%. The kappa statistic of 0.7431 indicates substantial agreement between predictions and actual values. Class-wise performance metrics show that the model has high sensitivity for the “High” price category (88.84%) and reasonable sensitivity for “Low” (83.15%) and “Medium” (76.69%) categories. Specificity was consistently high across all categories, ranging from 86.65% for “Medium” to 94.41% for “High.” The balanced accuracy, which considers both sensitivity and specificity, was also strong, particularly for “High” (91.63%) and “Low” (88.19%). Overall, the model effectively distinguishes between the three price categories.

# Feature Importance Bar Chart
importance_rf <- as.data.frame(importance(rf_model))
importance_rf$Feature <- rownames(importance_rf)
importance_rf <- importance_rf[order(-importance_rf$MeanDecreaseGini), ]

ggplot(importance_rf, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Feature Importance: Random Forest",
    x = "Feature",
    y = "Mean Decrease in Gini Index"
  )

The feature importance bar chart identifies the key predictors in the Random Forest model based on their contribution to reducing the Gini index. The features at the top of the chart, such as FLOOR_AREA, CBD_DIST, and LONGITUDE, are the most influential in determining the model’s classifications.

# Train Logistic Regression model
# Load necessary library
# Train Logistic Regression model
logit_model <- multinom(
  Price_Category ~ BEDROOMS + BATHROOMS + GARAGE + LAND_AREA + FLOOR_AREA + CBD_DIST + NEAREST_STN_DIST + NEAREST_SCH_DIST + HOUSE_AGE,
  data = trainData,
  trace = FALSE
)

# Predict on test data
testData$Predicted_Category_Logit <- predict(logit_model, newdata = testData)

# Evaluate the Prediction Model
conf_matrix_logit <- confusionMatrix(testData$Predicted_Category_Logit, testData$Price_Category)
print(conf_matrix_logit)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Medium High
##     Low    709    195   30
##     Medium 190    571  202
##     High    33    182  709
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7051          
##                  95% CI : (0.6879, 0.7219)
##     No Information Rate : 0.3361          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5576          
##                                           
##  Mcnemar's Test P-Value : 0.7412          
## 
## Statistics by Class:
## 
##                      Class: Low Class: Medium Class: High
## Sensitivity              0.7607        0.6023      0.7535
## Specificity              0.8809        0.7907      0.8856
## Pos Pred Value           0.7591        0.5929      0.7673
## Neg Pred Value           0.8818        0.7971      0.8777
## Prevalence               0.3304        0.3361      0.3336
## Detection Rate           0.2513        0.2024      0.2513
## Detection Prevalence     0.3311        0.3414      0.3275
## Balanced Accuracy        0.8208        0.6965      0.8195

The model achieved an overall accuracy of 70.51% with a 95% confidence interval of 68.79% to 72.19%, significantly outperforming the no-information rate of 33.61%. The kappa statistic of 0.5576 indicates moderate agreement between predictions and actual values. Class-wise performance shows good sensitivity for “Low” (76.07%) and “High” (75.35%) categories but relatively lower sensitivity for the “Medium” category (60.23%). Specificity values are high across all categories, ranging from 79.07% for “Medium” to 88.56% for “High.” The balanced accuracy, which averages sensitivity and specificity, is strong for “Low” (82.08%) and “High” (81.95%) but lower for “Medium” (69.65%).

Random Forest model’s performance: High category: Strong performance with most samples correctly classified. Medium category: Good performance, still have some misclassification as low or high. Low category: Good performance, still have significant misclassification into Medium

Logistic regression model’s performance: High category: medium performance, with most instances misclassified as Medium (202 instances). Medium category: poor performance and more misclassified as Low or High Low category: medium performance and significant misclassification into Medium (709 instances), requiring improvement.

The Random Forest model performs well for the “High” category, with accurate predictions. However, there is overlap between “Low” and “Medium” categories, indicating potential misclassification. Similarly, the Logistic Regression model shows concentrated accuracy in the “High” category but struggles to distinguish between “Low” and “Medium,” with notable misclassification.

# Partial Dependence Plot for FLOOR_AREA
pdp_floor <- partial(rf_model, pred.var = "FLOOR_AREA", train = trainData)
plot_pdp_floor <- autoplot(pdp_floor, contour = FALSE) +
  ggtitle("Partial Dependence: FLOOR_AREA") +
  theme_minimal()

# Partial Dependence Plot for LAND_AREA
pdp_land <- partial(rf_model, pred.var = "LAND_AREA", train = trainData)
plot_pdp_land <- autoplot(pdp_land, contour = FALSE) +
  ggtitle("Partial Dependence: LAND_AREA") +
  theme_minimal()

# Combine the plots side by side
plot_pdp_floor + plot_pdp_land

The Dependence Plot for FLOOR_AREA shows that increasing floor area initially boosts property prices, but beyond a certain point, the effect diminishes and turns negative. The Dependence Plot for LAND_AREA indicates that larger land areas initially raise property values, but after peaking, their impact weakens and declines.

4.2 House Price Prediction
In this task, we aim to predict the actual prices of Perth’s second-hand houses. Price prediction provides essential insights for valuing properties accurately and is critical for market analysis. We use LightGBM and XGBoost, two efficient gradient-boosting models, to predict house prices based on features such as FLOOR_AREA, LAND_AREA, CBD_DIST, and others.

# Load necessary library

# Set seed for reproducibility
set.seed(123)

# Create train-test split
trainIndex1 <- createDataPartition(all_perth_df$PRICE, p = 0.8, list = FALSE)
trainData1 <- all_perth_df[trainIndex, ]
testData1 <- all_perth_df[-trainIndex, ]
# Train LightGBM model

# Load necessary library
# Train LightGBM model
lgb_model <- lgb.train(
  params = list(objective = "regression", metric = "rmse", seed = 42),
  data = lgb.Dataset(data = as.matrix(trainData1 %>% select(-PRICE)), label = trainData1$PRICE),
  nrounds = 100,
  valids = list(train = lgb.Dataset(data = as.matrix(trainData1 %>% select(-PRICE)), label = trainData1$PRICE)),
  early_stopping_rounds = 10,
  verbose = -1 
)



# Predict on test data
testData1$Predicted_Price_LGB <- predict(lgb_model, as.matrix(testData1 %>% select(-PRICE)))

# Evaluate the Prediction Model
evaluate_model <- function(actual, predicted) { 
  rmse_value <- sqrt(mean((actual - predicted)^2))
  mae_value <- mean(abs(actual - predicted))
  r2_value <- 1 - (sum((actual - predicted)^2) / sum((actual - mean(actual))^2))
  list(RMSE = rmse_value, MAE = mae_value, R2 = r2_value)
}

results_lgb <- evaluate_model(testData1$PRICE, testData1$Predicted_Price_LGB)
print(results_lgb)
## $RMSE
## [1] 124339.9
## 
## $MAE
## [1] 77274.79
## 
## $R2
## [1] 0.8695867

The LightGBM regression model achieved solid performance, with an RMSE of 124,339.9, MAE of 77,274.79, and an R² of 0.8696, indicating a strong ability to explain the variance in housing prices. The model showed consistent improvement during training, reducing RMSE from 322,728 to 93,306.2 over 100 iterations. These results highlight the model’s effectiveness in capturing key patterns in the data, though additional tuning or feature refinement may further enhance accuracy.

# Train XGBoost model
# Load necessary library
# Define convert_to_numeric function
convert_to_numeric <- function(df) {
  df[] <- lapply(df, function(col) {
    if (is.factor(col) || is.character(col)) {
      as.numeric(as.factor(col))
    } else {
      col
    }
  })
  return(df)
}

# Convert train and test datasets to numeric
train_numeric <- convert_to_numeric(trainData1)
test_numeric <- convert_to_numeric(testData1)

# Select features and convert to matrix
features <- train_numeric %>%
  select(-PRICE) %>%
  colnames()

train_matrix <- as.matrix(train_numeric[, features])
test_matrix <- as.matrix(test_numeric[, features])

# Train XGBoost model
# Train XGBoost model with print_every_n
xgb_model <- xgb.train(
  params = list(objective = "reg:squarederror", max_depth = 6, eta = 0.3, subsample = 0.8),
  data = xgb.DMatrix(data = train_matrix, label = train_numeric$PRICE),
  nrounds = 100,
  watchlist = list(train = xgb.DMatrix(data = train_matrix, label = train_numeric$PRICE)),
  early_stopping_rounds = 10,
  verbose = 0
)


# Predict on test data
test_numeric$Predicted_Price_XGB <- predict(xgb_model, xgb.DMatrix(data = test_matrix))

# Evaluate the Prediction Model
evaluate_model <- function(actual, predicted) {
  rmse_value <- sqrt(mean((actual - predicted)^2))
  mae_value <- mean(abs(actual - predicted))
  r2_value <- 1 - (sum((actual - predicted)^2) / sum((actual - mean(actual))^2))
  list(RMSE = rmse_value, MAE = mae_value, R2 = r2_value)
}

results_xgb <- evaluate_model(test_numeric$PRICE, test_numeric$Predicted_Price_XGB)
print(results_xgb)
## $RMSE
## [1] 137848.4
## 
## $MAE
## [1] 90908.12
## 
## $R2
## [1] 0.8397108

The XGBoost model demonstrates its ability to predict house prices effectively, as indicated by its evaluation metrics. During training, the model’s RMSE steadily decreased from an initial value of 538,480 to 56,489 over 100 iterations, showcasing its learning and optimization process. When applied to test data, the model achieved an RMSE of 137,848.4, an MAE of 90,908.12, and an \(R^2\) value of 0.8397. These results highlight the model’s capability to minimize error and explain a significant proportion of the variance in house prices, confirming its suitability for the regression task.

The RMSE (Root Mean Square Error) for both two models decreasing steadily with each iteration, indicating consistent improvement in prediction accuracy during training. The LightGBM learning curve shows a steady RMSE decrease, with rapid error reduction initially and gradual convergence, indicating consistent accuracy improvement. Similarly, the XGBoost curve highlights sharp early RMSE declines, stabilizing over iterations, demonstrating effective learning and error minimization without overfitting.

The scatter plot shows predicted prices from LightGBM and XGBoost closely clustering around the diagonal, indicating strong and comparable accuracy, with slight deviations at higher price ranges. The histogram of residuals reveals both models’ errors are centered near zero, with LightGBM showing slightly tighter accuracy. The density plot highlights similar prediction trends between the models, with minor variations in spread, demonstrating comparable performance in property price forecasting.

# Combine evaluation metrics into a data frame
model_comparison <- data.frame(
  Metric = c("RMSE", "MAE", "R2"),
  LightGBM = c(results_lgb$RMSE, results_lgb$MAE, results_lgb$R2),
  XGBoost = c(results_xgb$RMSE, results_xgb$MAE, results_xgb$R2)
)

# Convert to long format for ggplot2

model_comparison_long <- pivot_longer(model_comparison, cols = c("LightGBM", "XGBoost"), names_to = "Model", values_to = "Value")

# Plot the comparison
ggplot(model_comparison_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Model Performance Comparison: LightGBM vs XGBoost",
    y = "Value",
    x = "Metric"
  ) +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) +  # Keep axis text horizontal
  geom_text(aes(label = round(Value, 2)), position = position_dodge(width = 0.7), vjust = -0.5, size = 3.5) +  # Add values above bars
  ylim(0, max(model_comparison_long$Value) * 1.2)  # Add more padding to the y-axis

This bar chart compares the performance of LightGBM and XGBoost models using three evaluation metrics: RMSE, MAE, and R². LightGBM outperforms XGBoost in both RMSE (124,339.86 vs. 137,848.35) and MAE (77,274.79 vs. 90,908.12), indicating it provides better overall predictions. However, both models exhibit comparable R² scores (0.87 for LightGBM and 0.84 for XGBoost), highlighting similar capabilities in explaining variance. The chart visually emphasizes LightGBM’s marginally superior accuracy for this dataset.

The bar charts reveal feature importance for LightGBM and XGBoost in predicting property prices. Both models rank “Postcode,” “Floor Area,” and “CBD Distance” as the most critical factors, followed by “Longitude” and “Latitude,” emphasizing spatial influences. Features like “Bedrooms” has minimal impact, providing insights into key variables driving predictions.

5 Conclusion
Key Determinants of Housing Prices:
Features such as FLOOR_AREA, LAND_AREA, CBD_DIST, and HOUSE_AGE play significant roles in determining house prices.
Model:
1. The Random Forest model outperforms Logistic Regression in classification tasks
2. Gradient boosting models (LightGBM and XGBoost) provide robust frameworks for predicting house prices
Actionable Insights for Stakeholders:
1. Real estate agents can focus on key features such as floor area, proximity to the CBD, and location for accurate pricing.
2. Buyers can make informed decisions based on historical trends and location-based analysis.

Reference
Mohd, T., Jamil, N.S., Johari, N., Abdullah, L., Masrom, S. (2020). An Overview of Real Estate Modelling Techniques for House Price Prediction. Charting a Sustainable Future of ASEAN in Business and Social Sciences. https://doi.org/10.1007/978-981-15-3859-9_28
Satish, G. N., Raghavendran, C. V., Rao, M. S., & Srinivasulu, C. (2019). House price prediction using machine learning. Journal of Innovative Technology and Exploring Engineering, 8(9), 717-722. DOI:10.35940/ijitee.I7849.078919
Zhang, Y., Huang, J., Zhang, J., Liu, S., & Shorman, S. (2022). Analysis and prediction of second-hand house price based on random forest. Applied Mathematics and Nonlinear Sciences, 7(1), 27-42. https://doi.org/10.2478/amns.2022.1.00052
Couronné, R., Probst, P. & Boulesteix, AL. (2018). Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics 19, 270. https://doi.org/10.1186/s12859-018-2264-5
Daghistani, T., & Alshammari, R. (2020). Comparison of statistical logistic regression and random forest machine learning techniques in predicting diabetes. Journal of Advances in Information Technology Vol, 11(2), 78-83. doi: 10.12720/jait.11.2.78-83 Zhang,D., & Gong,Y.(2020). The Comparison of LightGBM and XGBoost Coupling Factor Analysis and Prediagnosis of Acute Liver Failure. IEEE Access, 8, 220990-221003. doi:10.1109/access.2020.3042848
Gray, D. (2013). House Price Diffusion: An Application of Spectral Analysis to the Prices of Irish Second-Hand Dwellings. Housing Studies, 28(6), 869–890. https://doi.org/10.1080/02673037.2013.768335
Feng, M., Duan, Y., Wang, X., Zhang, J., & Ma, L. (2023). Carbon price prediction based on decomposition technique and extreme gradient boosting optimized by the grey wolf optimizer algorithm. Scientific reports, 13(1), 18447. https://doi.org/10.1038/s41598-023-45524-2
C. R. Madhuri, G. Anuradha and M. V. Pujitha. (2019). House Price Prediction Using Regression Techniques: A Comparative Study. International Conference on Smart Structures and Systems (ICSSS), Chennai, India, 2019, pp. 1-5, doi: 10.1109/ICSSS.2019.8882834.
Hu, Y. (2022). The Impact of COVID-19 on Perth House Price: A Machine Learning Perspective. Proceedings of the 2022 7th International Conference on Financial Innovation and Economic Development (ICFIED 2022). https://doi.org/10.2991/aebmr.k.220307.203