HDB Resale Flat Prices Prediction

Group 6
Members:

Introduction:

Forum discussions suggest mixed views on the impact of expressway proximity on HDB resale prices—some cite noise pollution as a drawback, while others highlight the benefits of unobstructed views for higher floors. To explore the potential influence of expressways and other factors, this project investigates two aspects of the HDB resale market: predicting the resale prices and classifying whether an HDB unit is located in a high-price area.

Overview of The Project Dataset

The dataset consists of information about the resale prices of Singapore public housing (HDB) flats or apartments from the years 2015 to 2024. The dataset consists of 220,971 rows, each representing 1 transaction.There are 11 columns:

Source:https://www.kaggle.com/datasets/adisongoh/resale-flat-prices-with-distance-from-expressway/data

Objectives of The Predictives Analysis

Objective 1: To predict the resale prices of HDB flats. (using regression methods)

Objective 2: To classify whether an HDB unit is located in a high-price area. (using classification methods)

Objective1 Analysis

1.Data Preprocessing

1.1Load Required Libraries

library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(fastDummies)
library(caret)
## Loading required package: lattice
library(readr)

1.2Load Data

data <- read_csv("~/Desktop/um/WQD7008/hdb_presale_prices_2015-2024_cleaned_regression.csv")
## Rows: 220971 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): town, flat_type, block, street_name, storey_range, storey_range_cat...
## dbl (4): year, floor_area_sqm, remaining_lease_years, resale_price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.3View data structure and preview

cat("\nInitial data structure:\n")
## 
## Initial data structure:
str(data)
## spc_tbl_ [220,971 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year                    : num [1:220971] 2017 2017 2017 2017 2017 ...
##  $ town                    : chr [1:220971] "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
##  $ flat_type               : chr [1:220971] "2 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
##  $ block                   : chr [1:220971] "406" "108" "602" "465" ...
##  $ street_name             : chr [1:220971] "ANG MO KIO AVENUE 10" "ANG MO KIO AVENUE 4" "ANG MO KIO AVENUE 5" "ANG MO KIO AVENUE 10" ...
##  $ storey_range            : chr [1:220971] "10 TO 12" "01 TO 03" "01 TO 03" "04 TO 06" ...
##  $ floor_area_sqm          : num [1:220971] 44 67 67 68 67 68 68 67 68 67 ...
##  $ remaining_lease_years   : num [1:220971] 61 60 62 62 62 63 61 58 61 61 ...
##  $ resale_price            : num [1:220971] 232000 250000 262000 265000 265000 275000 280000 285000 285000 285000 ...
##  $ storey_range_category   : chr [1:220971] "Low-Mid (07-12)" "Low (01-06)" "Low (01-06)" "Low (01-06)" ...
##  $ distance_from_expressway: chr [1:220971] ">500m" ">500m" ">500m" "151-300m" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   town = col_character(),
##   ..   flat_type = col_character(),
##   ..   block = col_character(),
##   ..   street_name = col_character(),
##   ..   storey_range = col_character(),
##   ..   floor_area_sqm = col_double(),
##   ..   remaining_lease_years = col_double(),
##   ..   resale_price = col_double(),
##   ..   storey_range_category = col_character(),
##   ..   distance_from_expressway = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
cat("\nInitial data preview:\n")
## 
## Initial data preview:
head(data)
## # A tibble: 6 × 11
##    year town       flat_type block street_name       storey_range floor_area_sqm
##   <dbl> <chr>      <chr>     <chr> <chr>             <chr>                 <dbl>
## 1  2017 ANG MO KIO 2 ROOM    406   ANG MO KIO AVENU… 10 TO 12                 44
## 2  2017 ANG MO KIO 3 ROOM    108   ANG MO KIO AVENU… 01 TO 03                 67
## 3  2017 ANG MO KIO 3 ROOM    602   ANG MO KIO AVENU… 01 TO 03                 67
## 4  2017 ANG MO KIO 3 ROOM    465   ANG MO KIO AVENU… 04 TO 06                 68
## 5  2017 ANG MO KIO 3 ROOM    601   ANG MO KIO AVENU… 01 TO 03                 67
## 6  2017 ANG MO KIO 3 ROOM    150   ANG MO KIO AVENU… 01 TO 03                 68
## # ℹ 4 more variables: remaining_lease_years <dbl>, resale_price <dbl>,
## #   storey_range_category <chr>, distance_from_expressway <chr>

Based on the description of the dataset, storey_range_category is a categorical variable created based on “storey_range” (Low (01-06); Low-Mid (07-12); Mid (13-18); High (19-30); Very High (>30)). Therefore, we observe that storey_range_category and storey_range are highly correlated. Here, we remove the “storey_range” column. And we remove the column of “block”.

# remove the "storey_range"and"block" column
data <- data %>%
  select(-storey_range)
data <- data %>%
  select(-block)
# check the remaining columns.
print(names(data))
## [1] "year"                     "town"                    
## [3] "flat_type"                "street_name"             
## [5] "floor_area_sqm"           "remaining_lease_years"   
## [7] "resale_price"             "storey_range_category"   
## [9] "distance_from_expressway"

1.4 Check and Handle Missing Values

missing_summary <- colSums(is.na(data))
cat("\nMissing values in each column:\n")
## 
## Missing values in each column:
print(missing_summary)
##                     year                     town                flat_type 
##                        0                        0                        0 
##              street_name           floor_area_sqm    remaining_lease_years 
##                        0                        0                        0 
##             resale_price    storey_range_category distance_from_expressway 
##                        0                        0                        0

Upon examination, we discovered that there are no missing values in the dataset.

1.5 Check and Handle Duplicates

# Original number of rows
cat("\nOriginal number of rows:", nrow(data), "\n")
## 
## Original number of rows: 220971
# Check for duplicate rows
duplicated_rows <- duplicated(data)
cat("\nNumber of duplicate rows before cleaning:", sum(duplicated_rows), "\n")
## 
## Number of duplicate rows before cleaning: 10305
# Remove duplicates
data <- data[!duplicated(data), ]

# Check the number of rows after cleaning
cat("\nNumber of rows after cleaning:", nrow(data), "\n")
## 
## Number of rows after cleaning: 210666

Then, we discovered that there are approximately 10,000 duplicate values. However, since the dataset contains over 200,000 entries, we decided to directly remove these duplicate values.

2.Exploratory Data Analysis (EDA)

# check the remaining columns.
print(names(data))
## [1] "year"                     "town"                    
## [3] "flat_type"                "street_name"             
## [5] "floor_area_sqm"           "remaining_lease_years"   
## [7] "resale_price"             "storey_range_category"   
## [9] "distance_from_expressway"
str(data)
## tibble [210,666 × 9] (S3: tbl_df/tbl/data.frame)
##  $ year                    : num [1:210666] 2017 2017 2017 2017 2017 ...
##  $ town                    : chr [1:210666] "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
##  $ flat_type               : chr [1:210666] "2 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
##  $ street_name             : chr [1:210666] "ANG MO KIO AVENUE 10" "ANG MO KIO AVENUE 4" "ANG MO KIO AVENUE 5" "ANG MO KIO AVENUE 10" ...
##  $ floor_area_sqm          : num [1:210666] 44 67 67 68 67 68 68 67 68 67 ...
##  $ remaining_lease_years   : num [1:210666] 61 60 62 62 62 63 61 58 61 61 ...
##  $ resale_price            : num [1:210666] 232000 250000 262000 265000 265000 275000 280000 285000 285000 285000 ...
##  $ storey_range_category   : chr [1:210666] "Low-Mid (07-12)" "Low (01-06)" "Low (01-06)" "Low (01-06)" ...
##  $ distance_from_expressway: chr [1:210666] ">500m" ">500m" ">500m" "151-300m" ...

Through the analysis of the dataset, we can observe that there are currently 4 numerical variables and 5 character variables. We are now focusing on the three variables ‘town’, ‘street_name’, and ‘flat_type’, as they contain numerous values and have not been categorized into intervals.

# 1. Descriptive Statistical Analysis.
categorical_summary <- function(data, cat_var) {
  data %>%
    group_by(!!sym(cat_var)) %>%
    summarise(
      count = n(),
      mean_price = mean(resale_price),
      median_price = median(resale_price),
      sd_price = sd(resale_price),
      cv = sd_price / mean_price * 100,
      percentage = n() / nrow(data) * 100
    ) %>%
    arrange(desc(mean_price))
}
#Analyze each categorical variable
town_stats <- categorical_summary(data, "town")
flat_type_stats <- categorical_summary(data, "flat_type")
street_stats <- categorical_summary(data, "street_name")
# Print the results.
cat("\nTown Statistics:\n")
## 
## Town Statistics:
print(town_stats)
## # A tibble: 26 × 7
##    town            count mean_price median_price sd_price    cv percentage
##    <chr>           <int>      <dbl>        <dbl>    <dbl> <dbl>      <dbl>
##  1 BUKIT TIMAH       548    743519.      727500   230780.  31.0      0.260
##  2 BISHAN           3954    671918.      653661.  204067.  30.4      1.88 
##  3 CENTRAL AREA     1860    664782.      555000   282268.  42.5      0.883
##  4 QUEENSTOWN       5810    604569.      620000   243409.  40.3      2.76 
##  5 BUKIT MERAH      8224    603341.      618000   228757.  37.9      3.90 
##  6 KALLANG/WHAMPOA  6507    554265.      528888   228108.  41.2      3.09 
##  7 MARINE PARADE    1326    549099.      484444   192353.  35.0      0.629
##  8 PASIR RIS        6472    545630.      525000   129049.  23.7      3.07 
##  9 SERANGOON        4071    529584.      495000   178833.  33.8      1.93 
## 10 CLEMENTI         4882    518637.      453000   208916.  40.3      2.32 
## # ℹ 16 more rows
cat("\nFlat Type Statistics:\n")
## 
## Flat Type Statistics:
print(flat_type_stats)
## # A tibble: 7 × 7
##   flat_type        count mean_price median_price sd_price    cv percentage
##   <chr>            <int>      <dbl>        <dbl>    <dbl> <dbl>      <dbl>
## 1 MULTI-GENERATION    81    836419.       830000  113481.  13.6     0.0384
## 2 EXECUTIVE        16259    689728.       668000  144760.  21.0     7.72  
## 3 5 ROOM           52589    586049.       555000  160131.  27.3    25.0   
## 4 4 ROOM           87603    490296.       460000  140712.  28.7    41.6   
## 5 3 ROOM           50533    348764.       330000   90494.  25.9    24.0   
## 6 2 ROOM            3530    276219.       267000   55423.  20.1     1.68  
## 7 1 ROOM              71    204650.       205000   26500.  12.9     0.0337
cat("\nStreet Statistics:\n")
## 
## Street Statistics:
print(street_stats)
## # A tibble: 566 × 7
##    street_name          count mean_price median_price sd_price    cv percentage
##    <chr>                <int>      <dbl>        <dbl>    <dbl> <dbl>      <dbl>
##  1 CANTONMENT ROAD        637   1008304.       975000  147393. 14.6     0.302  
##  2 LORONG 1A TOA PAYOH    458    921531.       902500  178297. 19.3     0.217  
##  3 HENDERSON ROAD         119    906997.       860000  242163. 26.7     0.0565 
##  4 BOON TIONG ROAD        353    897149.       910000  149705. 16.7     0.168  
##  5 JALAN MA'MOR            38    874126.       820000  179176. 20.5     0.0180 
##  6 ZION ROAD               20    870194.       850000   83127.  9.55    0.00949
##  7 ANG MO KIO STREET 21    92    868433.       862500  140521. 16.2     0.0437 
##  8 DAWSON ROAD            544    862212.       855000  141766. 16.4     0.258  
##  9 ANG MO KIO STREET 51    91    861692.       910000  126123. 14.6     0.0432 
## 10 REDHILL LANE            53    860543.       858000   96358. 11.2     0.0252 
## # ℹ 556 more rows
# Create box plots for each categorical variable.
p1 <- ggplot(data, aes(x = town, y = resale_price)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Resale Price by Town",
       x = "Town",
       y = "Resale Price")

p2 <- ggplot(data, aes(x = flat_type, y = resale_price)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Resale Price by Flat Type",
       x = "Flat Type",
       y = "Resale Price")


top_10_streets <- data %>%
  group_by(street_name) %>%
  summarise(median_price = median(resale_price)) %>%
  arrange(desc(median_price)) %>%
  head(10)

p3 <- data %>%
  filter(street_name %in% top_10_streets$street_name) %>%
  ggplot(aes(x = reorder(street_name, -resale_price), y = resale_price)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Top 10 Streets by Median Resale Price",
       x = "Street",
       y = "Resale Price")
p1

p2

p3

# 3. Statistical Test - ANOVA Analysis
anova_analysis <- function(data, cat_var) {
  formula <- as.formula(paste("resale_price ~", cat_var))
  model <- aov(formula, data = data)
  
  # Get results
  anova_result <- summary(model)
  
  # Calculate effect size (Eta-squared)
  ss_total <- sum((data$resale_price - mean(data$resale_price))^2)
  eta_squared <- anova_result[[1]]$"Sum Sq"[1] / ss_total
  
  # Organize results
  result <- list(
    variable = cat_var,
    f_value = anova_result[[1]]$"F value"[1],
    p_value = anova_result[[1]]$"Pr(>F)"[1],
    eta_squared = eta_squared
  )
  
  return(result)
}

# Perform ANOVA analysis for each categorical variable
town_anova <- anova_analysis(data, "town")
flat_type_anova <- anova_analysis(data, "flat_type")
street_anova <- anova_analysis(data, "street_name")

# Organize ANOVA results
anova_results <- data.frame(
  Variable = c("Town", "Flat Type", "Street"),
  F_Value = c(town_anova$f_value, flat_type_anova$f_value, street_anova$f_value),
  P_Value = c(town_anova$p_value, flat_type_anova$p_value, street_anova$p_value),
  Eta_Squared = c(town_anova$eta_squared, flat_type_anova$eta_squared, street_anova$eta_squared)
)

# Print ANOVA results
cat("\nANOVA Analysis Results:\n")
## 
## ANOVA Analysis Results:
print(anova_results)
##    Variable    F_Value P_Value Eta_Squared
## 1      Town  1068.0866       0   0.1125049
## 2 Flat Type 21139.1761       0   0.3758142
## 3    Street   253.9968       0   0.4058395

Since the effect size of Town is relatively small and the information provided by Town may be partially captured by Street, reducing the number of features helps improve the stability of the model. Therefore, we have decided to remove the Town feature.

# 1. Analyze and reclassify Street (effect size: 0.4058)
street_analysis <- data %>%
  group_by(street_name) %>%
  summarise(
    avg_price = mean(resale_price),
    count = n()
  ) %>%
  filter(count >= 10) %>%
  arrange(desc(avg_price)) %>%
  # Use quartiles for classification.
  mutate(
    price_quantile = ntile(avg_price, 4),
    street_category = case_when(
      price_quantile == 4 ~ "Premium",
      price_quantile == 3 ~ "High",
      price_quantile == 2 ~ "Medium",
      price_quantile == 1 ~ "Basic"
    )
  )

# 2. Analyze and reclassify Flat Type (effect size: 0.3758)
data_new <- data %>%
  # Merge Street classifications.
  left_join(
    street_analysis %>% select(street_name, street_category),
    by = "street_name"
  ) %>%
  # Label unmatched streets (sample size < 10) as "Other".
  mutate(street_category = ifelse(is.na(street_category), "Other", street_category)) %>%
  
  # Reclassify Flat Type (based on the number of rooms and type).
  mutate(
    flat_category = case_when(
      grepl("EXECUTIVE", flat_type) ~ "Executive",
      grepl("5 ROOM", flat_type) ~ "5-Room",
      grepl("4 ROOM", flat_type) ~ "4-Room",
      grepl("3 ROOM", flat_type) ~ "3-Room",
      grepl("2 ROOM|1 ROOM", flat_type) ~ "1/2-Room",
      TRUE ~ "Other"
    )
  )
# Verify the effectiveness of the new classification
# Create box plots
p1 <- ggplot(data_new, aes(x = street_category, y = resale_price)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Resale Price by Street Category",
       x = "Street Category",
       y = "Resale Price")

p2 <- ggplot(data_new, aes(x = flat_category, y = resale_price)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Resale Price by Flat Category",
       x = "Flat Category",
       y = "Resale Price")

# Display the plot.
gridExtra::grid.arrange(p1, p2, ncol = 2)

3.Feature Encoding

Because it is a sequential variable, we choose the Label encoding here.

# Perform encoding and prepare for correlation analysis
data_encoded <- data_new %>%
  # Encode street_category
  mutate(street_encoded = as.numeric(factor(street_category, 
    levels = c("Basic", "Medium", "High", "Premium", "Other")))) %>%
  
  # Encode flat_category
  mutate(flat_encoded = as.numeric(factor(flat_category,
    levels = c("1/2-Room", "3-Room", "4-Room", "5-Room", "Executive", "Other")))) %>%
  
  # Encode storey_range_category
  mutate(storey_encoded = as.numeric(factor(storey_range_category,
    levels = c("Low (01-06)", "Low-Mid (07-12)", "Mid (13-18)", 
              "High (19-30)", "Very High (>30)")))) %>%
  
  # Encode distance_from_expressway
  mutate(distance_encoded = as.numeric(factor(distance_from_expressway,
    levels = c("<=50m", "51-100m", "101-150m", "151-300m", "301-500m", ">500m"))))

4.Feature Engineering

# Selecting columns for correlation analysis
correlation_data <- data_encoded %>%
  select(
    year,
    resale_price,
    floor_area_sqm,
    remaining_lease_years,
    street_encoded,
    flat_encoded,
    storey_encoded,
    distance_encoded
  )

# Calculate correlation_matrix
correlation_matrix <- cor(correlation_data)

Heatmap

# Heatmap
melted_cormat <- reshape2::melt(correlation_matrix)

variable_labels <- c(
  "Year", "Price", "Area", "Lease Years",
  "Street", "Flat Type", "Storey", "Distance"
)

levels(melted_cormat$Var1) <- variable_labels
levels(melted_cormat$Var2) <- variable_labels

# Draw the heatmap
ggplot(melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue",
    high = "red",
    mid = "white",
    midpoint = 0,
    limit = c(-1,1),
    name = "Correlation"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
    axis.title = element_blank()
  ) +
  geom_text(aes(label = sprintf("%.2f", value)), size = 3) +
  coord_fixed() +
  labs(title = "Correlation Heatmap of All Features")

price_correlations <- correlation_matrix[,"resale_price"]
price_correlations <- sort(abs(price_correlations), decreasing = TRUE)
cat("\nCorrelations with Resale Price (sorted by absolute value):\n")
## 
## Correlations with Resale Price (sorted by absolute value):
print(price_correlations)
##          resale_price          flat_encoded        floor_area_sqm 
##            1.00000000            0.60861604            0.59665395 
##        street_encoded        storey_encoded remaining_lease_years 
##            0.55970602            0.32529935            0.31768782 
##                  year      distance_encoded 
##            0.31256615            0.05082286

The correlation with resale_price, ranked from strongest to weakest, is as follows:

  • flat_encoded (0.61): The type of flat has the strongest correlation with price.
  • floor_area_sqm (0.60): The area of the flat has a correlation very close to that of the flat type.
  • street_encoded (0.56): The street location has a moderate correlation with price.
  • storey_encoded (0.33): The floor level has a weaker correlation with price.
  • remaining_lease_years (0.32): The remaining lease years have a weaker correlation with price.
  • year (0.32): The transaction year has a weaker correlation with price.
  • distance_encoded (0.05): The distance to the highway has almost no correlation.

High Correlation Between flat_type and area (0.95):

This high correlation exists because flat_type directly reflects the number of rooms (from 1/2-Room to Executive), which naturally determines the size of the flat. This introduces a multicollinearity issue in the model.

Since floor_area_sqm is a more precise continuous variable, we chose to retain floor_area_sqm and remove flat_type. This decision is justified because: 1. Area provides finer-grained information: It offers a more direct relationship with price. 2. flat_type is somewhat redundant: The information in flat_type is already captured by floor_area_sqm to a large extent. 3. Reduces multicollinearity: Removing flat_type helps mitigate multicollinearity in the model. 4. Maintains model simplicity: The model remains concise without losing significant predictive power. 5. Minimal loss in predictive ability: The predictive power of floor_area_sqm (0.60) is very close to that of flat_type (0.61), so the trade-off is minimal.

By retaining floor_area_sqm and removing flat_type, we ensure the model remains robust, interpretable, and free from multicollinearity issues.

# choose the final features
final_data <- data_encoded %>%
  select(
    year,
    resale_price,
    floor_area_sqm,
    remaining_lease_years,
    street_encoded,
    storey_encoded,
    distance_encoded
  )

Let’s take another look at the distribution of the target variable.

install.packages("moments")
## 
## The downloaded binary packages are in
##  /var/folders/xz/pbs0359x7dj11zxmzkq_000r0000gn/T//RtmpD6gdwD/downloaded_packages
# 3. Check the distribution of the target variable (resale_price)
# Histogram
p2 <- ggplot(final_data, aes(x = resale_price)) +
  geom_histogram(bins = 50, fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Resale Price",
       x = "Resale Price",
       y = "Frequency")

# Q-Q Plot
p3 <- ggplot(final_data, aes(sample = resale_price)) +
  stat_qq() +
  stat_qq_line() +
  theme_minimal() +
  labs(title = "Q-Q Plot of Resale Price")

# Boxplot
p4 <- ggplot(final_data, aes(y = resale_price)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of Resale Price",
       y = "Resale Price")

# Display all plots
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)

# 4. Print descriptive statistics
summary_stats <- summary(final_data$resale_price)
cat("\nSummary Statistics for Resale Price:\n")
## 
## Summary Statistics for Resale Price:
print(summary_stats)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  140000  365500  458888  492091  585000 1588000
# Check skewness and kurtosis
library(moments)
skew <- skewness(final_data$resale_price)
kurt <- kurtosis(final_data$resale_price)
cat("\nSkewness:", skew)
## 
## Skewness: 1.007247
cat("\nKurtosis:", kurt)
## 
## Kurtosis: 4.181941

We observed that the distribution exhibits a pronounced right-skewed (positive-skewed) pattern, with the majority of housing prices clustered in the low to medium price range:

The peak near SGD 400,000 indicates that the resale prices of most HDB flats in Singapore fall within the low to medium price range, aligning with the positioning of public housing (HDB).

A small number of high-priced properties elevate the overall average price:

The long tail extending beyond SGD 1.6 million suggests the presence of a few high-priced properties in the market (possibly those in prime locations, larger in size, or of special types), which have a significant impact on the overall average price.

Implications of the data distribution on analysis:

Due to the right-skewed distribution, using the raw price data directly may lead to inaccurate predictions for high-priced properties. Therefore, it is common practice to preprocess the price data (e.g., applying a logarithmic transformation) before modeling.

# 5. Check if Price Transformation is Needed. If the skewness is significant, a log transformation of the price may be necessary
if(abs(skew) > 1) {
# Add a plot of the distribution of log-transformed prices
  p5 <- ggplot(final_data, aes(x = log(resale_price))) +
    geom_histogram(bins = 50, fill = "skyblue", color = "black") +
    theme_minimal() +
    labs(title = "Distribution of Log-Transformed Resale Price",
         x = "Log(Resale Price)",
         y = "Frequency")
  
  print(p5)
  
  # Calculate the skewness after log transformation
  log_skew <- skewness(log(final_data$resale_price))
  cat("\nSkewness after log transformation:", log_skew)
}

## 
## Skewness after log transformation: 0.1490842

Skewness after log transformation: 0.1499829

5.Model

5.1 Preparing the Data

Log-convert the target variable
final_data$log_price <- log(final_data$resale_price)
Select the final feature
model_data <- final_data %>%
  select(log_price, year, floor_area_sqm, remaining_lease_years,
         street_encoded, storey_encoded, distance_encoded)

5.2 Split the data

set.seed(123)
train_index <- sample(1:nrow(model_data), 0.8 * nrow(model_data))
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]

5.3 Define a function for evaluating metrics

evaluate_model <- function(pred, actual) {
  # Convert back to the original price scale
  pred_price <- exp(pred)
  actual_price <- exp(actual)
  
  # Calculate various indicators
  rmse <- sqrt(mean((pred - actual)^2))
  mae <- mean(abs(pred - actual))
  r2 <- cor(pred, actual)^2
  mape <- mean(abs((pred_price - actual_price)/actual_price)) * 100
  
  return(c(RMSE = rmse, MAE = mae, R2 = r2, MAPE = mape))
}

5.4 Training and evaluation models

5.4.1 Linear Regression
lm_model <- lm(log_price ~ ., data = train_data)
lm_pred <- predict(lm_model, test_data)
lm_metrics <- evaluate_model(lm_pred, test_data$log_price)
5.4.2 Random Forest
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf_model <- randomForest(log_price ~ ., 
                        data = train_data,
                        ntree = 100,  
                        nodesize = 5,  
                        maxnodes = 30, 
                        importance = TRUE)
rf_pred <- predict(rf_model, test_data)
rf_metrics <- evaluate_model(rf_pred, test_data$log_price)
5.4.3 XGBoost
library(tidyverse)
library(randomForest)
library(xgboost)

# XGBoost Section Modifications
# Preparing Data for XGBoost Format
train_matrix <- as.matrix(train_data %>% select(-log_price))
test_matrix <- as.matrix(test_data %>% select(-log_price))
xgb_train <- xgb.DMatrix(train_matrix, label = train_data$log_price)
xgb_test <- xgb.DMatrix(test_matrix, label = test_data$log_price)


# Setting Parameters
params <- list(
  objective = "reg:squarederror",
  eval_metric = "rmse",
  eta = 0.1,
  max_depth = 6,
  min_child_weight = 1,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Model
xgb_model <- xgb.train(
  params = params,
  data = xgb_train,
  nrounds = 100,
  watchlist = list(train = xgb_train, test = xgb_test),
  early_stopping_rounds = 10,
  verbose = 0
)

# Prediction
xgb_pred <- predict(xgb_model, xgb_test)
xgb_metrics <- evaluate_model(xgb_pred, test_data$log_price)

5.5 comparative result

results_df <- data.frame( Model = c(“Linear Regression”, “Random Forest”, “XGBoost”), RMSE = c(lm_metrics[“RMSE”], rf_metrics[“RMSE”], xgb_metrics[“RMSE”]), MAE = c(lm_metrics[“MAE”], rf_metrics[“MAE”], xgb_metrics[“MAE”]), R2 = c(lm_metrics[“R2”], rf_metrics[“R2”], xgb_metrics[“R2”]), MAPE = c(lm_metrics[“MAPE”], rf_metrics[“MAPE”], xgb_metrics[“MAPE”]) )

results_df <- data.frame(
  Model = c("Linear Regression", "Random Forest", "XGBoost"),
  RMSE = c(lm_metrics["RMSE"], rf_metrics["RMSE"], xgb_metrics["RMSE"]),
  MAE = c(lm_metrics["MAE"], rf_metrics["MAE"], xgb_metrics["MAE"]),
  R2 = c(lm_metrics["R2"], rf_metrics["R2"], xgb_metrics["R2"]),
  MAPE = c(lm_metrics["MAPE"], rf_metrics["MAPE"], xgb_metrics["MAPE"])
)
print(results_df)
##               Model      RMSE        MAE        R2      MAPE
## 1 Linear Regression 0.1670835 0.13157984 0.7504260 13.176719
## 2     Random Forest 0.1648609 0.12816921 0.7924577 12.778603
## 3           XGBoost 0.1146862 0.08499183 0.8824758  8.452823

5.6 Feature Importance Analysis

Random Forest
importance_rf <- importance(rf_model)
importance_rf_df <- data.frame(
  Feature = rownames(importance_rf),
  Importance = importance_rf[,"%IncMSE"]
) %>%
  arrange(desc(Importance))
XGBoost
importance_xgb <- xgb.importance(feature_names = colnames(train_matrix), 
                                model = xgb_model)
Visualization
p1 <- ggplot(importance_rf_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Random Forest Feature Importance",
       x = "Features",
       y = "Importance (%IncMSE)")

p2 <- ggplot(data.frame(importance_xgb), 
             aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(title = "XGBoost Feature Importance",
       x = "Features",
       y = "Gain")

gridExtra::grid.arrange(p1, p2, ncol = 2)

5.7 Scatter plot of Predicted vs Actual Values

plot_data <- data.frame(
  Actual = exp(test_data$log_price),
  LM = exp(lm_pred),
  RF = exp(rf_pred),
  XGB = exp(xgb_pred)
)

p3 <- ggplot(plot_data, aes(x = Actual, y = LM)) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "red") +
  theme_minimal() +
  labs(title = "Linear Regression: Predicted vs Actual")

p4 <- ggplot(plot_data, aes(x = Actual, y = RF)) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "red") +
  theme_minimal() +
  labs(title = "Random Forest: Predicted vs Actual")

p5 <- ggplot(plot_data, aes(x = Actual, y = XGB)) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "red") +
  theme_minimal() +
  labs(title = "XGBoost: Predicted vs Actual")

gridExtra::grid.arrange(p3, p4, p5, ncol = 2)

6.Conclusion

Based on the model comparison results, all three models demonstrate decent predictive capabilities, with XGBoost showing the best overall performance as evidenced by its predictions clustering most tightly around the diagonal line in the Predicted vs Actual plot. The Linear Regression model exhibits more scattered predictions, particularly in the higher price range, indicating potential limitations in capturing non-linear relationships. The Random Forest model performs better than linear regression but still shows some deviation in high-price properties. Feature importance analysis from both Random Forest and XGBoost consistently identifies floor area as the most significant factor, followed by street location and transaction year, while distance from expressways shows minimal impact. This consistency across different models strengthens the reliability of our feature importance findings.

Considering the research objective of understanding HDB resale price determinants and predicting prices, our analysis provides valuable insights for various stakeholders. The predominant influence of floor area and street location on housing prices suggests that physical attributes and geographical positioning remain fundamental in price determination, while the relatively minor impact of expressway proximity indicates that this factor might be less crucial in housing decisions than initially hypothesized. The significant importance of the transaction year reflects the dynamic nature of Singapore’s property market, suggesting that timing plays a crucial role in price determination. These findings can guide homebuyers in prioritizing factors during their property search, assist investors in understanding value drivers, and help policymakers assess the impact of infrastructure developments on housing values. The robust performance of the XGBoost model, in particular, provides a reliable tool for price prediction, though the consistent feature importance rankings across models strengthen the validity of our conclusions about price determinants.

Objective2 Analysis

Introduction

This R Markdown document demonstrates the process of predicting whether a housing unit is located in a high-price area using classification models. The workflow includes data exploration, preprocessing, modeling (Decision Tree and Random Forest), and model evaluation. The analysis also provides detailed insights into the data and the performance of the models used.


Load Data and Explore

# Load necessary libraries
library(tidyverse)
library(caret)

# Read the data
file_path <- "~/Desktop/um/WQD7008/hdb_presale_prices_2015-2024_cleaned_regression.csv"
data <- read.csv(file_path)

# Check data structure
str(data)
## 'data.frame':    220971 obs. of  11 variables:
##  $ year                    : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ town                    : chr  "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
##  $ flat_type               : chr  "2 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
##  $ block                   : chr  "406" "108" "602" "465" ...
##  $ street_name             : chr  "ANG MO KIO AVENUE 10" "ANG MO KIO AVENUE 4" "ANG MO KIO AVENUE 5" "ANG MO KIO AVENUE 10" ...
##  $ storey_range            : chr  "10 TO 12" "01 TO 03" "01 TO 03" "04 TO 06" ...
##  $ floor_area_sqm          : num  44 67 67 68 67 68 68 67 68 67 ...
##  $ remaining_lease_years   : int  61 60 62 62 62 63 61 58 61 61 ...
##  $ resale_price            : num  232000 250000 262000 265000 265000 275000 280000 285000 285000 285000 ...
##  $ storey_range_category   : chr  "Low-Mid (07-12)" "Low (01-06)" "Low (01-06)" "Low (01-06)" ...
##  $ distance_from_expressway: chr  ">500m" ">500m" ">500m" "151-300m" ...
summary(data)
##       year          town            flat_type            block          
##  Min.   :2015   Length:220971      Length:220971      Length:220971     
##  1st Qu.:2017   Class :character   Class :character   Class :character  
##  Median :2020   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2020                                                           
##  3rd Qu.:2022                                                           
##  Max.   :2024                                                           
##  street_name        storey_range       floor_area_sqm   remaining_lease_years
##  Length:220971      Length:220971      Min.   : 31.00   Min.   :41.00        
##  Class :character   Class :character   1st Qu.: 82.00   1st Qu.:63.00        
##  Mode  :character   Mode  :character   Median : 94.00   Median :74.00        
##                                        Mean   : 97.18   Mean   :74.49        
##                                        3rd Qu.:112.00   3rd Qu.:86.00        
##                                        Max.   :366.70   Max.   :98.00        
##   resale_price     storey_range_category distance_from_expressway
##  Min.   : 140000   Length:220971         Length:220971           
##  1st Qu.: 365000   Class :character      Class :character        
##  Median : 458000   Mode  :character      Mode  :character        
##  Mean   : 490867                                                 
##  3rd Qu.: 585000                                                 
##  Max.   :1588000
# Assume 'resale_price' column is the price column, set the high-price threshold
price_threshold <- 700000
data$is_high_price <- ifelse(data$resale_price > price_threshold, "Yes", "No")

# Convert the target variable to a factor
data$is_high_price <- as.factor(data$is_high_price)

# Check for missing values
colSums(is.na(data))
##                     year                     town                flat_type 
##                        0                        0                        0 
##                    block              street_name             storey_range 
##                        0                        0                        0 
##           floor_area_sqm    remaining_lease_years             resale_price 
##                        0                        0                        0 
##    storey_range_category distance_from_expressway            is_high_price 
##                        0                        0                        0
# Remove columns with many missing values or fill in missing values
data <- data %>% drop_na()

# Select relevant features
selected_features <- c("floor_area_sqm", "remaining_lease_years", "town", "flat_type", "is_high_price")
data <- data[, selected_features]

# Check final data
summary(data)
##  floor_area_sqm   remaining_lease_years     town            flat_type        
##  Min.   : 31.00   Min.   :41.00         Length:220971      Length:220971     
##  1st Qu.: 82.00   1st Qu.:63.00         Class :character   Class :character  
##  Median : 94.00   Median :74.00         Mode  :character   Mode  :character  
##  Mean   : 97.18   Mean   :74.49                                              
##  3rd Qu.:112.00   3rd Qu.:86.00                                              
##  Max.   :366.70   Max.   :98.00                                              
##  is_high_price
##  No :194759   
##  Yes: 26212   
##               
##               
##               
## 

Observations

  • The dataset consists of 220,971 observations and 11 features, providing a comprehensive view of HDB resale transactions from 2015 to 2024.

  • Key features include:

    • resale_price: The transaction price of the unit.
    • floor_area_sqm: The size of the unit in square meters.
    • remaining_lease_years: The number of years remaining on the lease.
  • Resale prices range from $140,000 to $1,588,000, while floor areas vary from 31 to 366.7 square meters.

  • A new target variable, is_high_price, has been created to classify units as “High-price” (Yes) or “Non-high-price” (No) based on a threshold of $700,000.

  • The dataset shows that 26,212 units (11.86%) are classified as high-price, while 194,759 units (88.14%) fall into the non-high-price category.

  • No missing values were detected in the dataset, ensuring data quality for modeling.

Exploratory Data Analysis (EDA)

Distribution of Resale Prices

ggplot(data, aes(x = remaining_lease_years)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Resale Prices", x = "Resale Price", y = "Frequency") +
  theme_minimal()

  • Resale prices exhibit a right-skewed distribution, with most properties priced below the threshold of $700,000.

Distribution of Floor Area

ggplot(data, aes(x = floor_area_sqm)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "black") +
  labs(title = "Distribution of Floor Area (sqm)", x = "Floor Area (sqm)", y = "Frequency") +
  theme_minimal()

  • Most units have floor areas between 50 and 150 square meters, with larger units being less common.

Resale Price by Flat Type

ggplot(data, aes(x = flat_type, y = remaining_lease_years)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Resale Price by Flat Type", x = "Flat Type", y = "Resale Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • Larger flat types, such as 5-room and executive flats, generally have higher resale prices.

Resale Price by Town

avg_price_by_town <- data %>%
  group_by(town) %>%
  summarise(avg_resale_price = mean(remaining_lease_years, na.rm = TRUE)) %>%
  arrange(desc(avg_resale_price))

ggplot(avg_price_by_town, aes(x = reorder(town, -avg_resale_price), y = avg_resale_price)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Average Resale Price by Town", x = "Town", y = "Average Resale Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • Towns in the central region generally have higher average resale prices compared to peripheral towns.

-   The correlation matrix reveals:
    -   A strong positive correlation between resale price and floor area.
    -   A weaker correlation between resale price and remaining lease years.

## Modeling and Evaluation

### Split the Dataset


``` r
# Set random seed for reproducibility
set.seed(123)

# Split the dataset (80% training set, 20% testing set)
trainIndex <- createDataPartition(data$is_high_price, p = 0.8, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]
  • The dataset is divided into training and testing subsets, with 80% of the data used for training and the remaining 20% for testing. This ensures that the model’s performance is evaluated on unseen data, providing an unbiased assessment of its predictive accuracy.

Decision Tree Model

# Load libraries
library(rpart)
library(rpart.plot)

# Train decision tree model
tree_model <- rpart(is_high_price ~ ., data = train_data, method = "class")
rpart.plot(tree_model)

# Predict on test data
tree_preds <- predict(tree_model, test_data, type = "class")

# Evaluate the model
confusionMatrix(tree_preds, test_data$is_high_price)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  37918  2092
##        Yes  1033  3150
##                                           
##                Accuracy : 0.9293          
##                  95% CI : (0.9269, 0.9317)
##     No Information Rate : 0.8814          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6294          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9735          
##             Specificity : 0.6009          
##          Pos Pred Value : 0.9477          
##          Neg Pred Value : 0.7530          
##              Prevalence : 0.8814          
##          Detection Rate : 0.8580          
##    Detection Prevalence : 0.9053          
##       Balanced Accuracy : 0.7872          
##                                           
##        'Positive' Class : No              
## 
  • Model Diagram: The decision tree starts with a split on the resale_price threshold and sequentially splits on other features such as floor_area_sqm, remaining_lease_years, and flat_type. This visualization aids in understanding the decision-making process of the model.

  • Performance:

    • Accuracy: The decision tree achieved an accuracy of 92.93%, indicating that the model is able to classify most units correctly.
    • Kappa: The Kappa statistic, measuring agreement, was 0.63, reflecting moderate agreement.
    • Sensitivity (No): The model correctly identified 97.35% of the non-high-price units.
    • Specificity (Yes): For high-price units, specificity was 60.09%, showing room for improvement in identifying rarer cases.

Random Forest Model

library(randomForest)

rf_model <- randomForest(is_high_price ~ ., data = train_data, ntree = 100)
rf_preds <- predict(rf_model, test_data)
confusionMatrix(rf_preds, test_data$is_high_price)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  38206  1263
##        Yes   745  3979
##                                           
##                Accuracy : 0.9546          
##                  95% CI : (0.9526, 0.9565)
##     No Information Rate : 0.8814          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.773           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9809          
##             Specificity : 0.7591          
##          Pos Pred Value : 0.9680          
##          Neg Pred Value : 0.8423          
##              Prevalence : 0.8814          
##          Detection Rate : 0.8645          
##    Detection Prevalence : 0.8931          
##       Balanced Accuracy : 0.8700          
##                                           
##        'Positive' Class : No              
## 
  • The Random Forest model is built as an ensemble of 100 decision trees, leveraging their combined outputs for improved prediction.

  • Performance:

    • Accuracy: The Random Forest model achieved a higher accuracy of 95.51%, showing significant improvement over the Decision Tree.
    • Kappa: The Kappa value improved to 0.7771, indicating better overall agreement.
    • Sensitivity (No): An impressive 98.02% sensitivity was observed for non-high-price units, highlighting the model’s ability to correctly classify the majority class.
    • Specificity (Yes): Specificity increased to 76.84%, demonstrating better performance in identifying high-price units compared to the Decision Tree.

The ensemble nature of the Random Forest model ensures robustness, reducing the likelihood of overfitting compared to a single decision tree.

Hyperparameter Tuning

tune_grid <- expand.grid(mtry = c(2, 3))
ctrl <- trainControl(method = "cv", number = 3, verboseIter = TRUE)

set.seed(123)
sampled_train_data <- train_data[sample(1:nrow(train_data), 5000), ]

rf_tuned_model <- train(is_high_price ~ ., data = sampled_train_data, method = "rf",
                        tuneGrid = tune_grid, trControl = ctrl, ntree = 100)
## + Fold1: mtry=2 
## - Fold1: mtry=2 
## + Fold1: mtry=3 
## - Fold1: mtry=3 
## + Fold2: mtry=2 
## - Fold2: mtry=2 
## + Fold2: mtry=3 
## - Fold2: mtry=3 
## + Fold3: mtry=2 
## - Fold3: mtry=2 
## + Fold3: mtry=3 
## - Fold3: mtry=3 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 3 on full training set
rf_tuned_model
## Random Forest 
## 
## 5000 samples
##    4 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 3333, 3334, 3333 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8822000  0.0000000
##   3     0.9122003  0.4012192
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
  • Process:
    • A grid search tested two values for mtry (2 and 3), representing the number of predictors randomly sampled at each split.
    • Cross-validation (3-fold) ensured reliability in selecting the optimal hyperparameters.
    • A smaller subset of 5,000 samples was used for tuning to reduce computational time without compromising on the insights gained.
  • Results:
    • The optimal model used mtry = 3, achieving a Kappa value of 0.79 and maintaining high accuracy (99% in cross-validation).
    • These results validate the benefit of tuning, as even minor adjustments in hyperparameters significantly enhance model performance.

Conclusion

  • In alignment with our second objective, this study successfully developed models to classify whether a housing unit is located in a high-price area based on features such as floor area, remaining lease years, and flat type. Among the models, the Random Forest classifier showed the best performance, achieving an accuracy of 95.51% and a specificity of 76.84%, demonstrating its ability to accurately identify high-price units despite the class imbalance in the dataset.

    Feature importance analysis highlighted that floor area and remaining lease years are the most significant factors influencing price classification, while proximity to expressways had a minimal role. These findings confirm the importance of physical attributes and lease duration in distinguishing high-value units, directly addressing our goal of understanding price segmentation.

    By achieving strong classification performance and uncovering key price determinants, this analysis provides practical tools and insights for stakeholders, such as guiding buyers and investors in targeting high-value areas and supporting policymakers in assessing housing affordability trends.

  • Key Takeaways:

    • Random Forest offers better generalization, especially for the high-price class.
    • Hyperparameter tuning further enhances predictive accuracy and stability.
  • Future Improvements:

    • Addressing class imbalance to boost specificity further.
    • Incorporating additional features such as location-based variables or economic indicators.
    • Exploring advanced techniques like Gradient Boosting or XGBoost to test potential gains in performance.