Group 6
Members:
TIAN TIANCHU [24056080]
YAO YUANWEI [24056483]
LI JINZHAO [23108894]
JIANG TAO [23095898]
ZHANG SHUBO [23120191]
Forum discussions suggest mixed views on the impact of expressway proximity on HDB resale prices—some cite noise pollution as a drawback, while others highlight the benefits of unobstructed views for higher floors. To explore the potential influence of expressways and other factors, this project investigates two aspects of the HDB resale market: predicting the resale prices and classifying whether an HDB unit is located in a high-price area.
The dataset consists of information about the resale prices of Singapore public housing (HDB) flats or apartments from the years 2015 to 2024. The dataset consists of 220,971 rows, each representing 1 transaction.There are 11 columns:
resale_price: transaction price (in Singapore dollars)
year: transaction year (range from 2015 to 2024)
town: name of town where the flat is located
flat_type: type of flat (e.g. 1-room, 2-room, 5-room, executive)
block: building number
street_name: street address
storey_range: the floor on which the apartment is located
floor_area_sqm: size of apartment in square meters
remaining_lease_years: number of years remaining in the lease (note: all Singapore HDB flats have a 99-year lease)
storey_range_category: categorical variable created based on “storey_range” (Low (01-06); Low-Mid (07-12); Mid (13-18); High (19-30); Very High (>30))
distance_from_expressway: distance from the building to the nearest expressway (<=50m, 51-100m, 101-150m, 151-300m, 301-500m, >500m)
Source:https://www.kaggle.com/datasets/adisongoh/resale-flat-prices-with-distance-from-expressway/data
Objective 1: To predict the resale prices of HDB flats. (using regression methods)
Objective 2: To classify whether an HDB unit is located in a high-price area. (using classification methods)
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(fastDummies)
library(caret)
## Loading required package: lattice
library(readr)
data <- read_csv("~/Desktop/um/WQD7008/hdb_presale_prices_2015-2024_cleaned_regression.csv")
## Rows: 220971 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): town, flat_type, block, street_name, storey_range, storey_range_cat...
## dbl (4): year, floor_area_sqm, remaining_lease_years, resale_price
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cat("\nInitial data structure:\n")
##
## Initial data structure:
str(data)
## spc_tbl_ [220,971 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:220971] 2017 2017 2017 2017 2017 ...
## $ town : chr [1:220971] "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
## $ flat_type : chr [1:220971] "2 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
## $ block : chr [1:220971] "406" "108" "602" "465" ...
## $ street_name : chr [1:220971] "ANG MO KIO AVENUE 10" "ANG MO KIO AVENUE 4" "ANG MO KIO AVENUE 5" "ANG MO KIO AVENUE 10" ...
## $ storey_range : chr [1:220971] "10 TO 12" "01 TO 03" "01 TO 03" "04 TO 06" ...
## $ floor_area_sqm : num [1:220971] 44 67 67 68 67 68 68 67 68 67 ...
## $ remaining_lease_years : num [1:220971] 61 60 62 62 62 63 61 58 61 61 ...
## $ resale_price : num [1:220971] 232000 250000 262000 265000 265000 275000 280000 285000 285000 285000 ...
## $ storey_range_category : chr [1:220971] "Low-Mid (07-12)" "Low (01-06)" "Low (01-06)" "Low (01-06)" ...
## $ distance_from_expressway: chr [1:220971] ">500m" ">500m" ">500m" "151-300m" ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. town = col_character(),
## .. flat_type = col_character(),
## .. block = col_character(),
## .. street_name = col_character(),
## .. storey_range = col_character(),
## .. floor_area_sqm = col_double(),
## .. remaining_lease_years = col_double(),
## .. resale_price = col_double(),
## .. storey_range_category = col_character(),
## .. distance_from_expressway = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
cat("\nInitial data preview:\n")
##
## Initial data preview:
head(data)
## # A tibble: 6 × 11
## year town flat_type block street_name storey_range floor_area_sqm
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2017 ANG MO KIO 2 ROOM 406 ANG MO KIO AVENU… 10 TO 12 44
## 2 2017 ANG MO KIO 3 ROOM 108 ANG MO KIO AVENU… 01 TO 03 67
## 3 2017 ANG MO KIO 3 ROOM 602 ANG MO KIO AVENU… 01 TO 03 67
## 4 2017 ANG MO KIO 3 ROOM 465 ANG MO KIO AVENU… 04 TO 06 68
## 5 2017 ANG MO KIO 3 ROOM 601 ANG MO KIO AVENU… 01 TO 03 67
## 6 2017 ANG MO KIO 3 ROOM 150 ANG MO KIO AVENU… 01 TO 03 68
## # ℹ 4 more variables: remaining_lease_years <dbl>, resale_price <dbl>,
## # storey_range_category <chr>, distance_from_expressway <chr>
Based on the description of the dataset, storey_range_category is a categorical variable created based on “storey_range” (Low (01-06); Low-Mid (07-12); Mid (13-18); High (19-30); Very High (>30)). Therefore, we observe that storey_range_category and storey_range are highly correlated. Here, we remove the “storey_range” column. And we remove the column of “block”.
# remove the "storey_range"and"block" column
data <- data %>%
select(-storey_range)
data <- data %>%
select(-block)
# check the remaining columns.
print(names(data))
## [1] "year" "town"
## [3] "flat_type" "street_name"
## [5] "floor_area_sqm" "remaining_lease_years"
## [7] "resale_price" "storey_range_category"
## [9] "distance_from_expressway"
missing_summary <- colSums(is.na(data))
cat("\nMissing values in each column:\n")
##
## Missing values in each column:
print(missing_summary)
## year town flat_type
## 0 0 0
## street_name floor_area_sqm remaining_lease_years
## 0 0 0
## resale_price storey_range_category distance_from_expressway
## 0 0 0
Upon examination, we discovered that there are no missing values in the dataset.
# Original number of rows
cat("\nOriginal number of rows:", nrow(data), "\n")
##
## Original number of rows: 220971
# Check for duplicate rows
duplicated_rows <- duplicated(data)
cat("\nNumber of duplicate rows before cleaning:", sum(duplicated_rows), "\n")
##
## Number of duplicate rows before cleaning: 10305
# Remove duplicates
data <- data[!duplicated(data), ]
# Check the number of rows after cleaning
cat("\nNumber of rows after cleaning:", nrow(data), "\n")
##
## Number of rows after cleaning: 210666
Then, we discovered that there are approximately 10,000 duplicate values. However, since the dataset contains over 200,000 entries, we decided to directly remove these duplicate values.
# check the remaining columns.
print(names(data))
## [1] "year" "town"
## [3] "flat_type" "street_name"
## [5] "floor_area_sqm" "remaining_lease_years"
## [7] "resale_price" "storey_range_category"
## [9] "distance_from_expressway"
str(data)
## tibble [210,666 × 9] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:210666] 2017 2017 2017 2017 2017 ...
## $ town : chr [1:210666] "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
## $ flat_type : chr [1:210666] "2 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
## $ street_name : chr [1:210666] "ANG MO KIO AVENUE 10" "ANG MO KIO AVENUE 4" "ANG MO KIO AVENUE 5" "ANG MO KIO AVENUE 10" ...
## $ floor_area_sqm : num [1:210666] 44 67 67 68 67 68 68 67 68 67 ...
## $ remaining_lease_years : num [1:210666] 61 60 62 62 62 63 61 58 61 61 ...
## $ resale_price : num [1:210666] 232000 250000 262000 265000 265000 275000 280000 285000 285000 285000 ...
## $ storey_range_category : chr [1:210666] "Low-Mid (07-12)" "Low (01-06)" "Low (01-06)" "Low (01-06)" ...
## $ distance_from_expressway: chr [1:210666] ">500m" ">500m" ">500m" "151-300m" ...
Through the analysis of the dataset, we can observe that there are currently 4 numerical variables and 5 character variables. We are now focusing on the three variables ‘town’, ‘street_name’, and ‘flat_type’, as they contain numerous values and have not been categorized into intervals.
# 1. Descriptive Statistical Analysis.
categorical_summary <- function(data, cat_var) {
data %>%
group_by(!!sym(cat_var)) %>%
summarise(
count = n(),
mean_price = mean(resale_price),
median_price = median(resale_price),
sd_price = sd(resale_price),
cv = sd_price / mean_price * 100,
percentage = n() / nrow(data) * 100
) %>%
arrange(desc(mean_price))
}
#Analyze each categorical variable
town_stats <- categorical_summary(data, "town")
flat_type_stats <- categorical_summary(data, "flat_type")
street_stats <- categorical_summary(data, "street_name")
# Print the results.
cat("\nTown Statistics:\n")
##
## Town Statistics:
print(town_stats)
## # A tibble: 26 × 7
## town count mean_price median_price sd_price cv percentage
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BUKIT TIMAH 548 743519. 727500 230780. 31.0 0.260
## 2 BISHAN 3954 671918. 653661. 204067. 30.4 1.88
## 3 CENTRAL AREA 1860 664782. 555000 282268. 42.5 0.883
## 4 QUEENSTOWN 5810 604569. 620000 243409. 40.3 2.76
## 5 BUKIT MERAH 8224 603341. 618000 228757. 37.9 3.90
## 6 KALLANG/WHAMPOA 6507 554265. 528888 228108. 41.2 3.09
## 7 MARINE PARADE 1326 549099. 484444 192353. 35.0 0.629
## 8 PASIR RIS 6472 545630. 525000 129049. 23.7 3.07
## 9 SERANGOON 4071 529584. 495000 178833. 33.8 1.93
## 10 CLEMENTI 4882 518637. 453000 208916. 40.3 2.32
## # ℹ 16 more rows
cat("\nFlat Type Statistics:\n")
##
## Flat Type Statistics:
print(flat_type_stats)
## # A tibble: 7 × 7
## flat_type count mean_price median_price sd_price cv percentage
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MULTI-GENERATION 81 836419. 830000 113481. 13.6 0.0384
## 2 EXECUTIVE 16259 689728. 668000 144760. 21.0 7.72
## 3 5 ROOM 52589 586049. 555000 160131. 27.3 25.0
## 4 4 ROOM 87603 490296. 460000 140712. 28.7 41.6
## 5 3 ROOM 50533 348764. 330000 90494. 25.9 24.0
## 6 2 ROOM 3530 276219. 267000 55423. 20.1 1.68
## 7 1 ROOM 71 204650. 205000 26500. 12.9 0.0337
cat("\nStreet Statistics:\n")
##
## Street Statistics:
print(street_stats)
## # A tibble: 566 × 7
## street_name count mean_price median_price sd_price cv percentage
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CANTONMENT ROAD 637 1008304. 975000 147393. 14.6 0.302
## 2 LORONG 1A TOA PAYOH 458 921531. 902500 178297. 19.3 0.217
## 3 HENDERSON ROAD 119 906997. 860000 242163. 26.7 0.0565
## 4 BOON TIONG ROAD 353 897149. 910000 149705. 16.7 0.168
## 5 JALAN MA'MOR 38 874126. 820000 179176. 20.5 0.0180
## 6 ZION ROAD 20 870194. 850000 83127. 9.55 0.00949
## 7 ANG MO KIO STREET 21 92 868433. 862500 140521. 16.2 0.0437
## 8 DAWSON ROAD 544 862212. 855000 141766. 16.4 0.258
## 9 ANG MO KIO STREET 51 91 861692. 910000 126123. 14.6 0.0432
## 10 REDHILL LANE 53 860543. 858000 96358. 11.2 0.0252
## # ℹ 556 more rows
# Create box plots for each categorical variable.
p1 <- ggplot(data, aes(x = town, y = resale_price)) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Resale Price by Town",
x = "Town",
y = "Resale Price")
p2 <- ggplot(data, aes(x = flat_type, y = resale_price)) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Resale Price by Flat Type",
x = "Flat Type",
y = "Resale Price")
top_10_streets <- data %>%
group_by(street_name) %>%
summarise(median_price = median(resale_price)) %>%
arrange(desc(median_price)) %>%
head(10)
p3 <- data %>%
filter(street_name %in% top_10_streets$street_name) %>%
ggplot(aes(x = reorder(street_name, -resale_price), y = resale_price)) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Top 10 Streets by Median Resale Price",
x = "Street",
y = "Resale Price")
p1
p2
p3
# 3. Statistical Test - ANOVA Analysis
anova_analysis <- function(data, cat_var) {
formula <- as.formula(paste("resale_price ~", cat_var))
model <- aov(formula, data = data)
# Get results
anova_result <- summary(model)
# Calculate effect size (Eta-squared)
ss_total <- sum((data$resale_price - mean(data$resale_price))^2)
eta_squared <- anova_result[[1]]$"Sum Sq"[1] / ss_total
# Organize results
result <- list(
variable = cat_var,
f_value = anova_result[[1]]$"F value"[1],
p_value = anova_result[[1]]$"Pr(>F)"[1],
eta_squared = eta_squared
)
return(result)
}
# Perform ANOVA analysis for each categorical variable
town_anova <- anova_analysis(data, "town")
flat_type_anova <- anova_analysis(data, "flat_type")
street_anova <- anova_analysis(data, "street_name")
# Organize ANOVA results
anova_results <- data.frame(
Variable = c("Town", "Flat Type", "Street"),
F_Value = c(town_anova$f_value, flat_type_anova$f_value, street_anova$f_value),
P_Value = c(town_anova$p_value, flat_type_anova$p_value, street_anova$p_value),
Eta_Squared = c(town_anova$eta_squared, flat_type_anova$eta_squared, street_anova$eta_squared)
)
# Print ANOVA results
cat("\nANOVA Analysis Results:\n")
##
## ANOVA Analysis Results:
print(anova_results)
## Variable F_Value P_Value Eta_Squared
## 1 Town 1068.0866 0 0.1125049
## 2 Flat Type 21139.1761 0 0.3758142
## 3 Street 253.9968 0 0.4058395
Since the effect size of Town is relatively small and the information provided by Town may be partially captured by Street, reducing the number of features helps improve the stability of the model. Therefore, we have decided to remove the Town feature.
# 1. Analyze and reclassify Street (effect size: 0.4058)
street_analysis <- data %>%
group_by(street_name) %>%
summarise(
avg_price = mean(resale_price),
count = n()
) %>%
filter(count >= 10) %>%
arrange(desc(avg_price)) %>%
# Use quartiles for classification.
mutate(
price_quantile = ntile(avg_price, 4),
street_category = case_when(
price_quantile == 4 ~ "Premium",
price_quantile == 3 ~ "High",
price_quantile == 2 ~ "Medium",
price_quantile == 1 ~ "Basic"
)
)
# 2. Analyze and reclassify Flat Type (effect size: 0.3758)
data_new <- data %>%
# Merge Street classifications.
left_join(
street_analysis %>% select(street_name, street_category),
by = "street_name"
) %>%
# Label unmatched streets (sample size < 10) as "Other".
mutate(street_category = ifelse(is.na(street_category), "Other", street_category)) %>%
# Reclassify Flat Type (based on the number of rooms and type).
mutate(
flat_category = case_when(
grepl("EXECUTIVE", flat_type) ~ "Executive",
grepl("5 ROOM", flat_type) ~ "5-Room",
grepl("4 ROOM", flat_type) ~ "4-Room",
grepl("3 ROOM", flat_type) ~ "3-Room",
grepl("2 ROOM|1 ROOM", flat_type) ~ "1/2-Room",
TRUE ~ "Other"
)
)
# Verify the effectiveness of the new classification
# Create box plots
p1 <- ggplot(data_new, aes(x = street_category, y = resale_price)) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Resale Price by Street Category",
x = "Street Category",
y = "Resale Price")
p2 <- ggplot(data_new, aes(x = flat_category, y = resale_price)) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Resale Price by Flat Category",
x = "Flat Category",
y = "Resale Price")
# Display the plot.
gridExtra::grid.arrange(p1, p2, ncol = 2)
Because it is a sequential variable, we choose the Label encoding here.
# Perform encoding and prepare for correlation analysis
data_encoded <- data_new %>%
# Encode street_category
mutate(street_encoded = as.numeric(factor(street_category,
levels = c("Basic", "Medium", "High", "Premium", "Other")))) %>%
# Encode flat_category
mutate(flat_encoded = as.numeric(factor(flat_category,
levels = c("1/2-Room", "3-Room", "4-Room", "5-Room", "Executive", "Other")))) %>%
# Encode storey_range_category
mutate(storey_encoded = as.numeric(factor(storey_range_category,
levels = c("Low (01-06)", "Low-Mid (07-12)", "Mid (13-18)",
"High (19-30)", "Very High (>30)")))) %>%
# Encode distance_from_expressway
mutate(distance_encoded = as.numeric(factor(distance_from_expressway,
levels = c("<=50m", "51-100m", "101-150m", "151-300m", "301-500m", ">500m"))))
# Selecting columns for correlation analysis
correlation_data <- data_encoded %>%
select(
year,
resale_price,
floor_area_sqm,
remaining_lease_years,
street_encoded,
flat_encoded,
storey_encoded,
distance_encoded
)
# Calculate correlation_matrix
correlation_matrix <- cor(correlation_data)
# Heatmap
melted_cormat <- reshape2::melt(correlation_matrix)
variable_labels <- c(
"Year", "Price", "Area", "Lease Years",
"Street", "Flat Type", "Storey", "Distance"
)
levels(melted_cormat$Var1) <- variable_labels
levels(melted_cormat$Var2) <- variable_labels
# Draw the heatmap
ggplot(melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1,1),
name = "Correlation"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title = element_blank()
) +
geom_text(aes(label = sprintf("%.2f", value)), size = 3) +
coord_fixed() +
labs(title = "Correlation Heatmap of All Features")
price_correlations <- correlation_matrix[,"resale_price"]
price_correlations <- sort(abs(price_correlations), decreasing = TRUE)
cat("\nCorrelations with Resale Price (sorted by absolute value):\n")
##
## Correlations with Resale Price (sorted by absolute value):
print(price_correlations)
## resale_price flat_encoded floor_area_sqm
## 1.00000000 0.60861604 0.59665395
## street_encoded storey_encoded remaining_lease_years
## 0.55970602 0.32529935 0.31768782
## year distance_encoded
## 0.31256615 0.05082286
The correlation with resale_price, ranked from strongest to weakest, is as follows:
High Correlation Between flat_type and area (0.95):
This high correlation exists because flat_type directly reflects the number of rooms (from 1/2-Room to Executive), which naturally determines the size of the flat. This introduces a multicollinearity issue in the model.
Since floor_area_sqm is a more precise continuous variable, we chose to retain floor_area_sqm and remove flat_type. This decision is justified because: 1. Area provides finer-grained information: It offers a more direct relationship with price. 2. flat_type is somewhat redundant: The information in flat_type is already captured by floor_area_sqm to a large extent. 3. Reduces multicollinearity: Removing flat_type helps mitigate multicollinearity in the model. 4. Maintains model simplicity: The model remains concise without losing significant predictive power. 5. Minimal loss in predictive ability: The predictive power of floor_area_sqm (0.60) is very close to that of flat_type (0.61), so the trade-off is minimal.
By retaining floor_area_sqm and removing flat_type, we ensure the model remains robust, interpretable, and free from multicollinearity issues.
# choose the final features
final_data <- data_encoded %>%
select(
year,
resale_price,
floor_area_sqm,
remaining_lease_years,
street_encoded,
storey_encoded,
distance_encoded
)
Let’s take another look at the distribution of the target variable.
install.packages("moments")
##
## The downloaded binary packages are in
## /var/folders/xz/pbs0359x7dj11zxmzkq_000r0000gn/T//RtmpD6gdwD/downloaded_packages
# 3. Check the distribution of the target variable (resale_price)
# Histogram
p2 <- ggplot(final_data, aes(x = resale_price)) +
geom_histogram(bins = 50, fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Resale Price",
x = "Resale Price",
y = "Frequency")
# Q-Q Plot
p3 <- ggplot(final_data, aes(sample = resale_price)) +
stat_qq() +
stat_qq_line() +
theme_minimal() +
labs(title = "Q-Q Plot of Resale Price")
# Boxplot
p4 <- ggplot(final_data, aes(y = resale_price)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of Resale Price",
y = "Resale Price")
# Display all plots
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)
# 4. Print descriptive statistics
summary_stats <- summary(final_data$resale_price)
cat("\nSummary Statistics for Resale Price:\n")
##
## Summary Statistics for Resale Price:
print(summary_stats)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 140000 365500 458888 492091 585000 1588000
# Check skewness and kurtosis
library(moments)
skew <- skewness(final_data$resale_price)
kurt <- kurtosis(final_data$resale_price)
cat("\nSkewness:", skew)
##
## Skewness: 1.007247
cat("\nKurtosis:", kurt)
##
## Kurtosis: 4.181941
We observed that the distribution exhibits a pronounced right-skewed (positive-skewed) pattern, with the majority of housing prices clustered in the low to medium price range:
The peak near SGD 400,000 indicates that the resale prices of most HDB flats in Singapore fall within the low to medium price range, aligning with the positioning of public housing (HDB).
A small number of high-priced properties elevate the overall average price:
The long tail extending beyond SGD 1.6 million suggests the presence of a few high-priced properties in the market (possibly those in prime locations, larger in size, or of special types), which have a significant impact on the overall average price.
Implications of the data distribution on analysis:
Due to the right-skewed distribution, using the raw price data directly may lead to inaccurate predictions for high-priced properties. Therefore, it is common practice to preprocess the price data (e.g., applying a logarithmic transformation) before modeling.
# 5. Check if Price Transformation is Needed. If the skewness is significant, a log transformation of the price may be necessary
if(abs(skew) > 1) {
# Add a plot of the distribution of log-transformed prices
p5 <- ggplot(final_data, aes(x = log(resale_price))) +
geom_histogram(bins = 50, fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Log-Transformed Resale Price",
x = "Log(Resale Price)",
y = "Frequency")
print(p5)
# Calculate the skewness after log transformation
log_skew <- skewness(log(final_data$resale_price))
cat("\nSkewness after log transformation:", log_skew)
}
##
## Skewness after log transformation: 0.1490842
Skewness after log transformation: 0.1499829
final_data$log_price <- log(final_data$resale_price)
model_data <- final_data %>%
select(log_price, year, floor_area_sqm, remaining_lease_years,
street_encoded, storey_encoded, distance_encoded)
set.seed(123)
train_index <- sample(1:nrow(model_data), 0.8 * nrow(model_data))
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
evaluate_model <- function(pred, actual) {
# Convert back to the original price scale
pred_price <- exp(pred)
actual_price <- exp(actual)
# Calculate various indicators
rmse <- sqrt(mean((pred - actual)^2))
mae <- mean(abs(pred - actual))
r2 <- cor(pred, actual)^2
mape <- mean(abs((pred_price - actual_price)/actual_price)) * 100
return(c(RMSE = rmse, MAE = mae, R2 = r2, MAPE = mape))
}
lm_model <- lm(log_price ~ ., data = train_data)
lm_pred <- predict(lm_model, test_data)
lm_metrics <- evaluate_model(lm_pred, test_data$log_price)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_model <- randomForest(log_price ~ .,
data = train_data,
ntree = 100,
nodesize = 5,
maxnodes = 30,
importance = TRUE)
rf_pred <- predict(rf_model, test_data)
rf_metrics <- evaluate_model(rf_pred, test_data$log_price)
library(tidyverse)
library(randomForest)
library(xgboost)
# XGBoost Section Modifications
# Preparing Data for XGBoost Format
train_matrix <- as.matrix(train_data %>% select(-log_price))
test_matrix <- as.matrix(test_data %>% select(-log_price))
xgb_train <- xgb.DMatrix(train_matrix, label = train_data$log_price)
xgb_test <- xgb.DMatrix(test_matrix, label = test_data$log_price)
# Setting Parameters
params <- list(
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.1,
max_depth = 6,
min_child_weight = 1,
subsample = 0.8,
colsample_bytree = 0.8
)
# Model
xgb_model <- xgb.train(
params = params,
data = xgb_train,
nrounds = 100,
watchlist = list(train = xgb_train, test = xgb_test),
early_stopping_rounds = 10,
verbose = 0
)
# Prediction
xgb_pred <- predict(xgb_model, xgb_test)
xgb_metrics <- evaluate_model(xgb_pred, test_data$log_price)
results_df <- data.frame( Model = c(“Linear Regression”, “Random Forest”, “XGBoost”), RMSE = c(lm_metrics[“RMSE”], rf_metrics[“RMSE”], xgb_metrics[“RMSE”]), MAE = c(lm_metrics[“MAE”], rf_metrics[“MAE”], xgb_metrics[“MAE”]), R2 = c(lm_metrics[“R2”], rf_metrics[“R2”], xgb_metrics[“R2”]), MAPE = c(lm_metrics[“MAPE”], rf_metrics[“MAPE”], xgb_metrics[“MAPE”]) )
results_df <- data.frame(
Model = c("Linear Regression", "Random Forest", "XGBoost"),
RMSE = c(lm_metrics["RMSE"], rf_metrics["RMSE"], xgb_metrics["RMSE"]),
MAE = c(lm_metrics["MAE"], rf_metrics["MAE"], xgb_metrics["MAE"]),
R2 = c(lm_metrics["R2"], rf_metrics["R2"], xgb_metrics["R2"]),
MAPE = c(lm_metrics["MAPE"], rf_metrics["MAPE"], xgb_metrics["MAPE"])
)
print(results_df)
## Model RMSE MAE R2 MAPE
## 1 Linear Regression 0.1670835 0.13157984 0.7504260 13.176719
## 2 Random Forest 0.1648609 0.12816921 0.7924577 12.778603
## 3 XGBoost 0.1146862 0.08499183 0.8824758 8.452823
importance_rf <- importance(rf_model)
importance_rf_df <- data.frame(
Feature = rownames(importance_rf),
Importance = importance_rf[,"%IncMSE"]
) %>%
arrange(desc(Importance))
importance_xgb <- xgb.importance(feature_names = colnames(train_matrix),
model = xgb_model)
p1 <- ggplot(importance_rf_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Random Forest Feature Importance",
x = "Features",
y = "Importance (%IncMSE)")
p2 <- ggplot(data.frame(importance_xgb),
aes(x = reorder(Feature, Gain), y = Gain)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "XGBoost Feature Importance",
x = "Features",
y = "Gain")
gridExtra::grid.arrange(p1, p2, ncol = 2)
plot_data <- data.frame(
Actual = exp(test_data$log_price),
LM = exp(lm_pred),
RF = exp(rf_pred),
XGB = exp(xgb_pred)
)
p3 <- ggplot(plot_data, aes(x = Actual, y = LM)) +
geom_point(alpha = 0.5) +
geom_abline(color = "red") +
theme_minimal() +
labs(title = "Linear Regression: Predicted vs Actual")
p4 <- ggplot(plot_data, aes(x = Actual, y = RF)) +
geom_point(alpha = 0.5) +
geom_abline(color = "red") +
theme_minimal() +
labs(title = "Random Forest: Predicted vs Actual")
p5 <- ggplot(plot_data, aes(x = Actual, y = XGB)) +
geom_point(alpha = 0.5) +
geom_abline(color = "red") +
theme_minimal() +
labs(title = "XGBoost: Predicted vs Actual")
gridExtra::grid.arrange(p3, p4, p5, ncol = 2)
6.Conclusion
Based on the model comparison results, all three models demonstrate decent predictive capabilities, with XGBoost showing the best overall performance as evidenced by its predictions clustering most tightly around the diagonal line in the Predicted vs Actual plot. The Linear Regression model exhibits more scattered predictions, particularly in the higher price range, indicating potential limitations in capturing non-linear relationships. The Random Forest model performs better than linear regression but still shows some deviation in high-price properties. Feature importance analysis from both Random Forest and XGBoost consistently identifies floor area as the most significant factor, followed by street location and transaction year, while distance from expressways shows minimal impact. This consistency across different models strengthens the reliability of our feature importance findings.
Considering the research objective of understanding HDB resale price determinants and predicting prices, our analysis provides valuable insights for various stakeholders. The predominant influence of floor area and street location on housing prices suggests that physical attributes and geographical positioning remain fundamental in price determination, while the relatively minor impact of expressway proximity indicates that this factor might be less crucial in housing decisions than initially hypothesized. The significant importance of the transaction year reflects the dynamic nature of Singapore’s property market, suggesting that timing plays a crucial role in price determination. These findings can guide homebuyers in prioritizing factors during their property search, assist investors in understanding value drivers, and help policymakers assess the impact of infrastructure developments on housing values. The robust performance of the XGBoost model, in particular, provides a reliable tool for price prediction, though the consistent feature importance rankings across models strengthen the validity of our conclusions about price determinants.
This R Markdown document demonstrates the process of predicting whether a housing unit is located in a high-price area using classification models. The workflow includes data exploration, preprocessing, modeling (Decision Tree and Random Forest), and model evaluation. The analysis also provides detailed insights into the data and the performance of the models used.
# Load necessary libraries
library(tidyverse)
library(caret)
# Read the data
file_path <- "~/Desktop/um/WQD7008/hdb_presale_prices_2015-2024_cleaned_regression.csv"
data <- read.csv(file_path)
# Check data structure
str(data)
## 'data.frame': 220971 obs. of 11 variables:
## $ year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ town : chr "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
## $ flat_type : chr "2 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
## $ block : chr "406" "108" "602" "465" ...
## $ street_name : chr "ANG MO KIO AVENUE 10" "ANG MO KIO AVENUE 4" "ANG MO KIO AVENUE 5" "ANG MO KIO AVENUE 10" ...
## $ storey_range : chr "10 TO 12" "01 TO 03" "01 TO 03" "04 TO 06" ...
## $ floor_area_sqm : num 44 67 67 68 67 68 68 67 68 67 ...
## $ remaining_lease_years : int 61 60 62 62 62 63 61 58 61 61 ...
## $ resale_price : num 232000 250000 262000 265000 265000 275000 280000 285000 285000 285000 ...
## $ storey_range_category : chr "Low-Mid (07-12)" "Low (01-06)" "Low (01-06)" "Low (01-06)" ...
## $ distance_from_expressway: chr ">500m" ">500m" ">500m" "151-300m" ...
summary(data)
## year town flat_type block
## Min. :2015 Length:220971 Length:220971 Length:220971
## 1st Qu.:2017 Class :character Class :character Class :character
## Median :2020 Mode :character Mode :character Mode :character
## Mean :2020
## 3rd Qu.:2022
## Max. :2024
## street_name storey_range floor_area_sqm remaining_lease_years
## Length:220971 Length:220971 Min. : 31.00 Min. :41.00
## Class :character Class :character 1st Qu.: 82.00 1st Qu.:63.00
## Mode :character Mode :character Median : 94.00 Median :74.00
## Mean : 97.18 Mean :74.49
## 3rd Qu.:112.00 3rd Qu.:86.00
## Max. :366.70 Max. :98.00
## resale_price storey_range_category distance_from_expressway
## Min. : 140000 Length:220971 Length:220971
## 1st Qu.: 365000 Class :character Class :character
## Median : 458000 Mode :character Mode :character
## Mean : 490867
## 3rd Qu.: 585000
## Max. :1588000
# Assume 'resale_price' column is the price column, set the high-price threshold
price_threshold <- 700000
data$is_high_price <- ifelse(data$resale_price > price_threshold, "Yes", "No")
# Convert the target variable to a factor
data$is_high_price <- as.factor(data$is_high_price)
# Check for missing values
colSums(is.na(data))
## year town flat_type
## 0 0 0
## block street_name storey_range
## 0 0 0
## floor_area_sqm remaining_lease_years resale_price
## 0 0 0
## storey_range_category distance_from_expressway is_high_price
## 0 0 0
# Remove columns with many missing values or fill in missing values
data <- data %>% drop_na()
# Select relevant features
selected_features <- c("floor_area_sqm", "remaining_lease_years", "town", "flat_type", "is_high_price")
data <- data[, selected_features]
# Check final data
summary(data)
## floor_area_sqm remaining_lease_years town flat_type
## Min. : 31.00 Min. :41.00 Length:220971 Length:220971
## 1st Qu.: 82.00 1st Qu.:63.00 Class :character Class :character
## Median : 94.00 Median :74.00 Mode :character Mode :character
## Mean : 97.18 Mean :74.49
## 3rd Qu.:112.00 3rd Qu.:86.00
## Max. :366.70 Max. :98.00
## is_high_price
## No :194759
## Yes: 26212
##
##
##
##
The dataset consists of 220,971 observations and 11 features, providing a comprehensive view of HDB resale transactions from 2015 to 2024.
Key features include:
resale_price: The transaction price of the unit.floor_area_sqm: The size of the unit in square
meters.remaining_lease_years: The number of years remaining on
the lease.Resale prices range from $140,000 to $1,588,000, while floor areas vary from 31 to 366.7 square meters.
A new target variable, is_high_price, has been
created to classify units as “High-price” (Yes) or “Non-high-price” (No)
based on a threshold of $700,000.
The dataset shows that 26,212 units (11.86%) are classified as high-price, while 194,759 units (88.14%) fall into the non-high-price category.
No missing values were detected in the dataset, ensuring data quality for modeling.
ggplot(data, aes(x = remaining_lease_years)) +
geom_histogram(bins = 50, fill = "steelblue", color = "black") +
labs(title = "Distribution of Resale Prices", x = "Resale Price", y = "Frequency") +
theme_minimal()
ggplot(data, aes(x = floor_area_sqm)) +
geom_histogram(bins = 30, fill = "darkgreen", color = "black") +
labs(title = "Distribution of Floor Area (sqm)", x = "Floor Area (sqm)", y = "Frequency") +
theme_minimal()
ggplot(data, aes(x = flat_type, y = remaining_lease_years)) +
geom_boxplot(fill = "orange", color = "black") +
labs(title = "Resale Price by Flat Type", x = "Flat Type", y = "Resale Price") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
avg_price_by_town <- data %>%
group_by(town) %>%
summarise(avg_resale_price = mean(remaining_lease_years, na.rm = TRUE)) %>%
arrange(desc(avg_resale_price))
ggplot(avg_price_by_town, aes(x = reorder(town, -avg_resale_price), y = avg_resale_price)) +
geom_bar(stat = "identity", fill = "purple") +
labs(title = "Average Resale Price by Town", x = "Town", y = "Average Resale Price") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
- The correlation matrix reveals:
- A strong positive correlation between resale price and floor area.
- A weaker correlation between resale price and remaining lease years.
## Modeling and Evaluation
### Split the Dataset
``` r
# Set random seed for reproducibility
set.seed(123)
# Split the dataset (80% training set, 20% testing set)
trainIndex <- createDataPartition(data$is_high_price, p = 0.8, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]
# Load libraries
library(rpart)
library(rpart.plot)
# Train decision tree model
tree_model <- rpart(is_high_price ~ ., data = train_data, method = "class")
rpart.plot(tree_model)
# Predict on test data
tree_preds <- predict(tree_model, test_data, type = "class")
# Evaluate the model
confusionMatrix(tree_preds, test_data$is_high_price)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 37918 2092
## Yes 1033 3150
##
## Accuracy : 0.9293
## 95% CI : (0.9269, 0.9317)
## No Information Rate : 0.8814
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6294
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9735
## Specificity : 0.6009
## Pos Pred Value : 0.9477
## Neg Pred Value : 0.7530
## Prevalence : 0.8814
## Detection Rate : 0.8580
## Detection Prevalence : 0.9053
## Balanced Accuracy : 0.7872
##
## 'Positive' Class : No
##
Model Diagram: The decision tree starts with a
split on the resale_price threshold and sequentially splits
on other features such as floor_area_sqm,
remaining_lease_years, and flat_type. This
visualization aids in understanding the decision-making process of the
model.
Performance:
library(randomForest)
rf_model <- randomForest(is_high_price ~ ., data = train_data, ntree = 100)
rf_preds <- predict(rf_model, test_data)
confusionMatrix(rf_preds, test_data$is_high_price)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 38206 1263
## Yes 745 3979
##
## Accuracy : 0.9546
## 95% CI : (0.9526, 0.9565)
## No Information Rate : 0.8814
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.773
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9809
## Specificity : 0.7591
## Pos Pred Value : 0.9680
## Neg Pred Value : 0.8423
## Prevalence : 0.8814
## Detection Rate : 0.8645
## Detection Prevalence : 0.8931
## Balanced Accuracy : 0.8700
##
## 'Positive' Class : No
##
The Random Forest model is built as an ensemble of 100 decision trees, leveraging their combined outputs for improved prediction.
Performance:
The ensemble nature of the Random Forest model ensures robustness, reducing the likelihood of overfitting compared to a single decision tree.
tune_grid <- expand.grid(mtry = c(2, 3))
ctrl <- trainControl(method = "cv", number = 3, verboseIter = TRUE)
set.seed(123)
sampled_train_data <- train_data[sample(1:nrow(train_data), 5000), ]
rf_tuned_model <- train(is_high_price ~ ., data = sampled_train_data, method = "rf",
tuneGrid = tune_grid, trControl = ctrl, ntree = 100)
## + Fold1: mtry=2
## - Fold1: mtry=2
## + Fold1: mtry=3
## - Fold1: mtry=3
## + Fold2: mtry=2
## - Fold2: mtry=2
## + Fold2: mtry=3
## - Fold2: mtry=3
## + Fold3: mtry=2
## - Fold3: mtry=2
## + Fold3: mtry=3
## - Fold3: mtry=3
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 3 on full training set
rf_tuned_model
## Random Forest
##
## 5000 samples
## 4 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 3333, 3334, 3333
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8822000 0.0000000
## 3 0.9122003 0.4012192
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
mtry (2 and 3),
representing the number of predictors randomly sampled at each
split.mtry = 3, achieving a Kappa
value of 0.79 and maintaining high accuracy
(99% in cross-validation).In alignment with our second objective, this study successfully developed models to classify whether a housing unit is located in a high-price area based on features such as floor area, remaining lease years, and flat type. Among the models, the Random Forest classifier showed the best performance, achieving an accuracy of 95.51% and a specificity of 76.84%, demonstrating its ability to accurately identify high-price units despite the class imbalance in the dataset.
Feature importance analysis highlighted that floor area and remaining lease years are the most significant factors influencing price classification, while proximity to expressways had a minimal role. These findings confirm the importance of physical attributes and lease duration in distinguishing high-value units, directly addressing our goal of understanding price segmentation.
By achieving strong classification performance and uncovering key price determinants, this analysis provides practical tools and insights for stakeholders, such as guiding buyers and investors in targeting high-value areas and supporting policymakers in assessing housing affordability trends.
Key Takeaways:
Future Improvements: