The housing market is a critical component of any economy, serving as a key indicator of financial stability and urban development (Clerc, 2019). Predicting house prices is a valuable exercise that not only aids in understanding market dynamics but also provides actionable insights for various stakeholders, including real estate developers, investors, policymakers, and urban planners (Mohammad et al., 2024). House prices are influenced by a combination of structural, locational, and economic factors. Among these, variables such as house age, distance to transportation infrastructure (e.g., the nearest MRT station), the availability of convenience stores, and geographical attributes like latitude and longitude, play a pivotal role in determining property values (Cellmer and Kobylinska, 2024).
This study seeks to predict house prices using these influential factors, leveraging advanced analytical methods to uncover the relationships between these variables and property valuation. The findings from this research are vital for several reasons. For businesses in the real estate and construction sectors, understanding these relationships enables more accurate pricing strategies and better targeting of potential buyers. Investors and developers can use these insights to identify high-value opportunities, minimize risks, and optimize their portfolio strategies.
For policymakers and urban planners, the study highlights the impact of public infrastructure, such as MRT stations, and the spatial distribution of amenities on property values. Such information is crucial for designing equitable urban development policies, optimizing land use, and improving overall city planning. Additionally, homebuyers can benefit from a clearer understanding of how these factors influence house prices, enabling them to make informed purchasing decisions.
By focusing on predictive modeling with relevant and measurable variables, this study contributes to the growing field of data-driven real estate analytics, offering robust methodologies to address complex market dynamics. The results hold the potential to enhance decision-making, promote sustainable urban growth, and support the strategic objectives of businesses and stakeholders in the housing sector.
The broad aim of this study is to identify the variables that best predict house prices and to determine the most effective predictive model among ensemble and non-ensemble regression machine learning models. To achieve this aim, the study is guided by the following specific objectives:
To examine the relationship between house prices and
individual predictor variables:
This includes analyzing how each factor—house age, distance to the
nearest MRT station, number of convenience stores, latitude, and
longitude—independently affects house prices. The goal is to understand
the strength and direction of these relationships.
To investigate the combined effect of multiple variables
on house prices:
By exploring the interaction between the predictor variables, the study
aims to evaluate how these factors collectively influence property
valuation.
To preprocess and prepare the data for predictive
modeling:
This includes normalizing the numeric variables, splitting the data into
training and testing sets, and visualizing patterns within the data to
ensure readiness for model training and evaluation.
To implement and evaluate non-ensemble regression machine
learning models:
Models such as linear regression, ridge regression, and lasso regression
will be applied to predict house prices. Their performance will be
assessed based on metrics such as root mean square error (RMSE) and mean
absolute error (MAE).
To implement and evaluate ensemble regression machine
learning models:
Models such as random forest, tuned random forest, and XGBoost will be
employed to predict house prices. These models will also be evaluated
using performance metrics like RMSE and MAE to compare their predictive
power.
To compare the performance of ensemble and non-ensemble
models:
By systematically analyzing and comparing the performance of both types
of models, the study seeks to determine which approach yields the most
accurate predictions of house prices.
To identify the best predictive model for house price
estimation:
Among all the evaluated models, the study aims to pinpoint the model
that consistently delivers the highest accuracy and reliability in
predicting house prices.
To provide actionable insights for stakeholders based on
model findings:
The study will derive recommendations for real estate developers,
investors, policymakers, and urban planners by interpreting the results
and highlighting the most influential variables in house price
prediction.
The data cleaning process involved dropping the serial number column, as it was not relevant for analysis, checking for missing values (none were found), and renaming column headers for convenience and readability. These steps ensured that the dataset was well-structured and prepared for further analysis.
Individual graphs, including histograms and box plots, were plotted to visualize the distributions of key variables. Skewness and kurtosis values revealed varying degrees of asymmetry and peakedness in the data. For instance, the house age had low skewness (0.38) and kurtosis (2.12), indicating a relatively symmetric distribution, while the distance to the market exhibited high skewness (1.88) and kurtosis (6.15), suggesting a heavy tail and significant deviation from normality. This suggests that the non-linear models will best predict house prices than linear regression.
ageskew <- skewness(data$house_age)
agekut <- kurtosis(data$house_age)
cat("For house age, skewness is", ageskew, "and kurtosis is", agekut, "\n")
## For house age, skewness is 0.3815374 and kurtosis is 2.11898
distskew <- skewness(data$distance_to_MRT)
distkut <- kurtosis(data$distance_to_MRT)
cat("For distance to market, skewness is", distskew, "and kurtosis is", distkut, "\n")
## For distance to market, skewness is 1.881906 and kurtosis is 6.154799
storeskew <- skewness(data$num_convenience_stores)
storekut <- kurtosis(data$num_convenience_stores)
cat("For number of convenience, skewness is", storeskew, "and kurtosis is", storekut, "\n")
## For number of convenience, skewness is 0.1540458 and kurtosis is 1.932619
priceskew <- skewness(data$house_price)
pricekut <- kurtosis(data$house_price)
cat("For house price, skewness is", priceskew, "and kurtosis is", pricekut, "\n")
## For house price, skewness is 0.597677 and kurtosis is 5.13841
The dataset was split into training and testing sets using an 80:20 ratio respectively to ensure sufficient data for both model training and evaluation. The training set was used to build predictive models, while the testing set assessed their generalization performance.
# Normalize numeric variables
numeric_cols <- c("house_age", "distance_to_MRT", "num_convenience_stores", "latitude", "longitude", "house_price")
data[numeric_cols] <- scale(data[numeric_cols])
# Split data into training and testing sets (80:20 rule)
set.seed(123)
trainIndex <- createDataPartition(data$house_price, p = 0.8, list = FALSE)
train <- data[trainIndex, ]
test <- data[-trainIndex, ]
Pairwise scatterplots of the training and testing data showed similar distributions and relationships among the variables, indicating that the split preserved the underlying patterns and variability in the dataset. This consistency ensures that models trained on the training set can effectively generalize to unseen data.
# Visualize training and testing sets
pairs(train[, numeric_cols], main = "Train Data Pairwise Plot")
pairs(test[, numeric_cols], main = "Test Data Pairwise Plot")
# Visualize distributions in train and test
train_test_distribution <- lapply(names(data)[-1], function(var) {
ggplot() +
geom_density(data = train, aes(x = .data[[var]]), fill = "blue", alpha = 0.3) +
geom_density(data = test, aes(x = .data[[var]]), fill = "red", alpha = 0.3) +
labs(title = paste("Train vs Test Distribution -", var), x = var, y = "Density") +
theme_minimal()
})
grid.arrange(grobs = train_test_distribution, ncol = 2)
Non-ensemble regression models, such as linear regression, ridge regression, and lasso regression, are individual predictive algorithms that rely on a single hypothesis to model relationships between independent variables and a target variable (Mathotaarachchi et al., 2024). Linear regression assumes a linear relationship, while ridge regression adds a penalty to reduce multicollinearity, and lasso regression performs variable selection by shrinking coefficients of less important features to zero (Li, 2023). These models are straightforward to implement, interpretable, and computationally efficient, making them ideal for understanding the direct relationships between predictors and outcomes (Deng, 2024).
The regression model indicates that the most important features influencing house price, based on statistical significance (p-values), are house age, distance to MRT, number of convenience stores, latitude, and transaction date, all of which have p-values less than 0.001. Among these, house age and distance to MRT negatively affect house prices, while latitude, transaction date, and number of convenience stores positively contribute. The longitude variable is not statistically significant (p = 0.55) and likely has minimal influence on the model. The RMSE (Root Mean Squared Error) of 0.5744 measures the model’s average prediction error, indicating that, on average, the predicted house prices deviate from the actual prices by approximately 0.5744 units in the scale of the dependent variable.
lm_model <- lm(house_price ~ ., data = train)
lm_pred <- predict(lm_model, test)
lm_rmse <- sqrt(mean((lm_pred - test$house_price)^2))
summary(lm_model) #explore the model details
##
## Call:
## lm(formula = house_price ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6328 -0.3948 -0.0567 0.3036 5.4837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -848.98910 265.33108 -3.200 0.00151 **
## transaction_date 0.42173 0.13180 3.200 0.00151 **
## house_age -0.22402 0.03748 -5.977 5.96e-09 ***
## distance_to_MRT -0.43926 0.07401 -5.935 7.53e-09 ***
## num_convenience_stores 0.24558 0.04627 5.307 2.06e-07 ***
## latitude 0.22288 0.04523 4.928 1.32e-06 ***
## longitude -0.03689 0.06192 -0.596 0.55174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6698 on 325 degrees of freedom
## Multiple R-squared: 0.5796, Adjusted R-squared: 0.5718
## F-statistic: 74.68 on 6 and 325 DF, p-value: < 2.2e-16
cat("RMSE for the linear model is", lm_rmse)
## RMSE for the linear model is 0.5744137
The RMSE (Root Mean Square Error) of 0.568282 indicates the average difference between the predicted and actual values in the same units as the dependent variable, with lower values representing better model performance. Among the features, transaction_date (0.3886), num_convenience_stores (0.2448), and latitude (0.2254) contribute positively to the predicted outcome, while house_age (-0.2093) and distance_to_MRT (-0.3716) have a negative influence. The intercept (-782.26) indicates the baseline prediction when all feature values are zero. Overall, the positive and negative coefficients reflect the direction and relative importance of these features in the model.
ridge_model <- glmnet(as.matrix(train[, -7]), train$house_price, alpha = 0)
ridge_pred <- predict(ridge_model, s = 0.01, newx = as.matrix(test[, -7]))
ridge_rmse <- sqrt(mean((ridge_pred - test$house_price)^2))
cat("RMSE for the Ridge Regression model is", ridge_rmse, "\n")
## RMSE for the Ridge Regression model is 0.568282
summary(ridge_rmse) #explore the model details
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5683 0.5683 0.5683 0.5683 0.5683 0.5683
# Extract coefficients for the selected lambda (s = 0.01)
ridge_coefficients <- coef(ridge_model, s = 0.01)
# Display the coefficients
cat("Coefficients for Ridge Regression model:\n")
## Coefficients for Ridge Regression model:
print(ridge_coefficients)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -782.25555151
## transaction_date 0.38857731
## house_age -0.20931406
## distance_to_MRT -0.37162779
## num_convenience_stores 0.24479678
## latitude 0.22537883
## longitude 0.01604567
The Lasso Regression model identifies the most important features as transaction_date, house_age, distance_to_MRT, num_convenience_stores, and latitude, as these have nonzero coefficients. Among them, distance_to_MRT has the largest negative impact on the target variable, followed by house_age, while transaction_date and num_convenience_stores positively influence the target. The RMSE (Root Mean Squared Error) of 0.5696548 indicates the average magnitude of the model’s prediction errors, with smaller values reflecting better predictive performance. However, the absence of significance for longitude suggests it is not an impactful predictor.
lasso_model <- glmnet(as.matrix(train[, -7]), train$house_price, alpha = 1)
lasso_pred <- predict(lasso_model, s = 0.01, newx = as.matrix(test[, -7]))
lasso_rmse <- sqrt(mean((lasso_pred - test$house_price)^2))
cat("RMSE for the Lasso Regression model is", lasso_rmse, "\n")
## RMSE for the Lasso Regression model is 0.5696548
# Extract coefficients for the selected lambda (s = 0.01)
lasso_coefficients <- coef(lasso_model, s = 0.01)
# Display the coefficients
cat("Coefficients for Lasso Regression model:\n")
## Coefficients for Lasso Regression model:
print(lasso_coefficients)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -772.6876354
## transaction_date 0.3838245
## house_age -0.2126234
## distance_to_MRT -0.4036062
## num_convenience_stores 0.2415918
## latitude 0.2204268
## longitude .
# Prepare the design matrix and response variable
X <- as.matrix(train[, -7]) # Exclude the target variable
y <- train$house_price
# Fit the Lasso model
lasso_fit <- glmnet(X, y, alpha = 1)
# Extract the coefficients for the selected lambda
lambda <- 0.01
beta <- as.numeric(coef(lasso_fit, s = lambda)[-1]) # Exclude intercept
# Perform selective inference
lasso_inference <- fixedLassoInf(X, y, beta = beta, lambda = lambda)
# Display results
print(lasso_inference)
##
## Call:
## fixedLassoInf(x = X, y = y, beta = beta, lambda = lambda)
##
## Standard deviation of noise (specified or estimated) sigma = 0.670
##
## Testing results at lambda = 0.010, with alpha = 0.100
##
## Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
## 1 0.418 3.177 0.001 0.195 0.635 0.048 0.050
## 2 -0.223 -5.957 0.000 -0.285 -0.161 0.048 0.048
## 3 -0.407 -7.982 0.000 -0.492 -0.323 0.048 0.050
## 4 0.247 5.348 0.000 0.170 0.324 0.048 0.048
## 5 0.226 5.027 0.000 0.152 0.300 0.050 0.049
##
## Note: coefficients shown are partial regression coefficients
The Ridge Regression model has the best performance among the three models, as it has the lowest RMSE (0.56282). A lower RMSE indicates that the model’s predictions are closer to the actual values on average, making it the most accurate of the three regression models in this case.
Ensemble regression models combine multiple algorithms to improve prediction accuracy and model robustness (renju, 2024). In this study, random forest, tuned random forest, and XGBoost were used. Random forest builds numerous decision trees and averages their predictions to reduce overfitting and enhance generalization (Boyapati et al., 2023). Tuned random forest optimizes hyperparameters, such as the number of features considered at each split, for better performance (Hoxha, 2024). XGBoost, an advanced gradient boosting technique, sequentially builds trees to minimize error, offering superior speed and efficiency (Dobrovolska and Fenenko, 2024). These models are powerful for capturing complex interactions and non-linear relationships in data.
Variable Importance provides two measures: %IncMSE: How much the Mean Squared Error increases when the variable is permuted. IncNodePurity: The total decrease in node impurity contributed by each variable across all trees. (Boyapati et al., 2023)
The RMSE (Root Mean Squared Error) of 0.429992 indicates the average difference between the predicted and actual values in the same units as the target variable, with lower values representing better model performance. Regarding feature importance, the most influential variables based on both %IncMSE (predictive power) and IncNodePurity (reduction in variance) are latitude, distance_to_MRT, and longitude, suggesting that location and proximity to transport significantly affect the target variable. Other moderately important features include house_age and num_convenience_stores, while transaction_date has the least impact on the model’s predictions.
# Fit the Random Forest model
rf_model <- randomForest(house_price ~ ., data = train, importance = TRUE)
# Predict on test data
rf_pred <- predict(rf_model, test)
# Calculate RMSE
rf_rmse <- sqrt(mean((rf_pred - test$house_price)^2))
cat("RMSE for Random Forest model:", rf_rmse, "\n")
## RMSE for Random Forest model: 0.4230619
# Extract Variable Importance
importance_scores <- importance(rf_model)
print("Variable Importance:")
## [1] "Variable Importance:"
print(importance_scores)
## %IncMSE IncNodePurity
## transaction_date 4.616076 13.28709
## house_age 21.885280 47.44996
## distance_to_MRT 25.572889 107.44111
## num_convenience_stores 18.642815 36.66097
## latitude 29.125910 72.37410
## longitude 18.906679 52.84746
# Plot Variable Importance
varImpPlot(rf_model, main = "Variable Importance Plot")
The tuned random forest regression model has an RMSE (Root Mean Square Error) of 0.4165, indicating the average error between the predicted and actual values in the same scale as the target variable. Among the features, latitude (highest X.IncMSE and significant IncNodePurity) and distance_to_MRT stand out as the most important predictors, suggesting they have the greatest impact on the model’s accuracy and ability to split data effectively. Other influential features include house_age and longitude, while transaction_date contributes the least. This implies the model heavily relies on geographic and proximity-related variables for predictions.
# Fit the tuned random forest model
tuned_rf_model <- randomForest(house_price ~ ., data = train, ntree = 500, mtry = 3, importance = TRUE)
# Predict on test data
tuned_rf_pred <- predict(tuned_rf_model, test)
# Calculate RMSE
tuned_rf_rmse <- sqrt(mean((tuned_rf_pred - test$house_price)^2))
cat("Tuned Random Forest RMSE:", tuned_rf_rmse, "\n")
## Tuned Random Forest RMSE: 0.4196182
# Get feature importance
importance_rf <- importance(tuned_rf_model)
importance_df <- data.frame(Variable = rownames(importance_rf), importance_rf)
print(importance_df)
## Variable X.IncMSE IncNodePurity
## transaction_date transaction_date 2.921752 13.22966
## house_age house_age 22.318900 47.33545
## distance_to_MRT distance_to_MRT 30.400356 130.84954
## num_convenience_stores num_convenience_stores 15.279664 26.20740
## latitude latitude 28.948769 74.67247
## longitude longitude 17.034276 42.89673
# Plot feature importance
ggplot(importance_df, aes(x = reorder(Variable, -IncNodePurity), y = IncNodePurity)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Feature Importance (Random Forest)", x = "Variable", y = "Increase in Node Purity")
The RMSE (Root Mean Square Error) of 0.406949 indicates the average
prediction error of the XGBoost regression model, with smaller values
representing better accuracy. Regarding feature importance,
distance_to_MRT is the most influential
predictor, as it has the highest gain (0.597), indicating its strong
contribution to reducing model error.
house_age is also significant, with a high
cover (0.360), meaning it impacts a larger proportion of data splits.
Other features like latitude and
transaction_date are moderately important,
while longitude and
num_convenience_stores are less
influential based on their lower gain and frequency values.
# Create XGBoost DMatrix objects
xgb_train <- xgb.DMatrix(data = as.matrix(train[, -7]), label = train$house_price)
xgb_test <- xgb.DMatrix(data = as.matrix(test[, -7]), label = test$house_price)
# Train the XGBoost model
xgb_model <- xgboost(
data = xgb_train,
nrounds = 100,
objective = "reg:squarederror",
verbose = 0
)
# Make predictions and calculate RMSE
xgb_pred <- predict(xgb_model, xgb_test)
xgb_rmse <- sqrt(mean((xgb_pred - test$house_price)^2))
cat("RMSE for XGBoost model:", xgb_rmse, "\n")
## RMSE for XGBoost model: 0.406949
# Retrieve feature importance
importance <- xgb.importance(model = xgb_model)
print(importance)
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: distance_to_MRT 0.59746952 0.13448653 0.14554295
## 2: house_age 0.18782196 0.36049081 0.30664506
## 3: latitude 0.09961284 0.20822550 0.13225284
## 4: transaction_date 0.04833544 0.13244576 0.28395462
## 5: longitude 0.04435060 0.11783811 0.08427877
## 6: num_convenience_stores 0.02240964 0.04651329 0.04732577
# Visualize feature importance
xgb.plot.importance(importance_matrix = importance, top_n = 10)
The XGBoost model performed the best among the ensemble models because it has the lowest RMSE of 0.406949, indicating the smallest average prediction error. The tuned Random Forest model follows with an RMSE of 0.4165, and the standard Random Forest model has the highest RMSE of 0.429992, making it the least accurate of the three.
Among all the models, XGBoost has the best performance with the lowest RMSE of 0.406949, indicating the highest accuracy and smallest average prediction error. The next best-performing models are Tuned Random Forest (RMSE: 0.4165414) and Random Forest (RMSE: 0.4299920). Linear, Ridge, and Lasso regression models perform significantly worse, with RMSE values above 0.56, showing they are less effective for this task compared to the ensemble methods.
# Compare Model Performance
results <- data.frame(
Model = c("Linear Regression", "Ridge Regression", "Lasso Regression", "Random Forest", "Tuned Random Forest", "XGBoost"),
RMSE = c(lm_rmse, ridge_rmse, lasso_rmse, rf_rmse, tuned_rf_rmse, xgb_rmse)
)
print(results)
## Model RMSE
## 1 Linear Regression 0.5744137
## 2 Ridge Regression 0.5682820
## 3 Lasso Regression 0.5696548
## 4 Random Forest 0.4230619
## 5 Tuned Random Forest 0.4196182
## 6 XGBoost 0.4069490
This study evaluated the performance of various regression models in predicting house prices, focusing on both individual and ensemble approaches. Among the models tested, ensemble models consistently outperformed individual regression models in predictive accuracy, with XGBoost achieving the best performance. The XGBoost model had the lowest RMSE (0.4069), followed by the tuned Random Forest model (0.4165). These results reinforces the superior capability of ensemble methods in capturing complex relationships and patterns in the data.
In terms of feature importance, distance to MRT emerged as the most significant predictor across all models, particularly in the XGBoost model, where it had the highest gain value (0.597). This finding highlights the critical role of proximity to transportation in determining house prices. Other important predictors included house age (negatively correlated with house prices) and latitude, which also contributed meaningfully to the models’ predictions. While features such as transaction date and number of convenience stores positively influenced house prices, their relative impact was less pronounced in ensemble models compared to linear regression. Conversely, longitude was consistently found to have minimal influence, reflecting its limited predictive value for this dataset.
The results of this study emphasize the utility of advanced ensemble methods like XGBoost for house price prediction, as they combine predictive accuracy with insights into feature importance. Future research could explore further fine-tuning of hyperparameters and the inclusion of additional spatial or economic features to enhance model performance and applicability.
Boyapati, Sai Venkat & Karthik, Maddirala & Subrahmanyam, Konakanchi & Reddy, B. (2023). An Analysis of House Price Prediction Using Ensemble Learning Algorithms. Research Reports on Computer Science. 87-96. Doi:10.37256/rrcs.2320232639.
Cellmer, Radoslaw & Kobylińska, Katarzyna. (2024). Housing Price Prediction - Machine Learning and Geostatistical Methods. Real Estate Management and Valuation. Doi:10.2478/remav-2025-0001.
Clerc, Laurent. (2019). Towards a Global Real Estate Market? Trends and Evidence. Doi:10.1007/978-3-030-11674-3_6.
Deng, Zhicong. (2024). Comparative Analysis of the Effectiveness of Different Algorithms for House Price Prediction in Machine Learning. Advances in Economics, Management and Political Sciences. 136. 101-107. Doi:10.54254/2754-1169/2024.18705.
Dobrovolska, O., & Fenenko, N. (2024). Forecasting Trends in the Real Estate Market: Analysis of Relevant Determinants. Financial Markets, Institutions and Risks, 8(3), 227-253. http://doi.org/10.61093/fmir.8(3).227-253.2024.
Hoxha, V. (2024), “Comparative analysis of machine learning models in predicting housing prices: a case study of Prishtina’s real estate market”, International Journal of Housing Markets and Analysis, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/IJHMA-09-2023-0120
Li, Yixu. (2023). Analysis of Real Estate Predictions Based on Different Models. Highlights in Science, Engineering and Technology. 76. 410-414. Doi:10.54097/vbmqmh04.
Mathotaarachchi, K. V., Hasan, R., & Mahmood, S. (2024). Advanced Machine Learning Techniques for Predictive Modeling of Property Prices. Information, 15(6), 295. https://doi.org/10.3390/info15060295
Mohamad, Abdul Hayy Haziq & AB-RAHIM, ROSSAZANA & Rahman, Abdul & Mohamed Esa, Mohd. (2024). Housing Market: A Bibliometric Analysis from 2020 to 2024. International Journal of Academic Research in Business and Social Sciences. 14. 274-295. Doi:10.6007/IJARBSS/v14-i8/22390.
Renju K & S, Freni. (2024). Ensemble Approach for Predicting The Price of Residential Property. International Journal of Information Technology, Research and Applications. 3. 27-38. Doi:10.59461/ijitra.v3i2.99.
Yeh, I. (2018). Real Estate Valuation [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5J30W.