INTRODUCTION

In this article, I will predict the value of real estate using a regression model. This model will determine what factors affect the fluctuations in the value of a real estate. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.

DATA PREPROCESSING

Install Packages

library(car)
library(dplyr)
library(GGally)
library(ggplot2)
library(lmtest)
library(manipulate)
library(MASS)
library(MLmetrics)
library(performance)

Import Data

realestate <- readxl::read_xlsx("Real estate valuation data set.xlsx")
realestate <- realestate %>%
  dplyr::select(-No)

Column Name Change

names(realestate) <- c("date", "age", "MRT_station", "store", "lat", "long", "price")

Adjusting Data Type

realestate$store <- as.integer(realestate$store)

Missing Value Check

anyNA(realestate)
#> [1] FALSE

Glossary Data

head(realestate)
  • date : the transaction date (for example, 2013.2 = 2013 March, 2013.5 = 2013 June, etc.)
  • age : the house age (unit: year)
  • MRT_station : the distance to the nearest MRT station (unit: meter)
  • store : the number of convenience stores in the living circle on foot (integer)
  • lat : the geographic coordinate, latitude. (unit: degree)
  • long : the geographic coordinate, longitude. (unit: degree)
  • price : house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared) ## Data Split
RNGkind(sample.kind = "Rounding") 
set.seed(123)

# index sampling
index <- sample(x = nrow(realestate), size= nrow(realestate)*0.75)

# splitting
train <- realestate[index,]
test <- realestate[-index,]

Correlation Between Variables

ggcorr(train, label = T)

Based on the correlation output above: - Variable store, lat, and long have a strong positive correlation with price. - Variable date has a weak positive correlation with price. - Variable age has a weak negative correlation with price. - Variable MRT_station has a strong negative correlation with price.

Initial Assumptions: - Transaction date and age of real estate have less effect on real estate prices. - The further north and east the property is located, the price of the property will increase. - The more convenience stores around the property, the higher the property price. - The further away the property is from the nearest MRT station, the lower the property price.

DATA MODELLING

All Predictors Model

model_all <- lm(price ~ ., data = realestate)
summary(model_all)
#> 
#> Call:
#> lm(formula = price ~ ., data = realestate)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -35.667  -5.412  -0.967   4.217  75.190 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value        Pr(>|t|)    
#> (Intercept) -14441.982719   6775.386097  -2.132         0.03364 *  
#> date             5.149017      1.556876   3.307         0.00103 ** 
#> age             -0.269697      0.038530  -7.000 0.0000000000106 ***
#> MRT_station     -0.004488      0.000718  -6.250 0.0000000010373 ***
#> store            1.133325      0.188160   6.023 0.0000000038269 ***
#> lat            225.470143     44.565775   5.059 0.0000006382166 ***
#> long           -12.429061     48.581168  -0.256         0.79820    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.858 on 407 degrees of freedom
#> Multiple R-squared:  0.5824, Adjusted R-squared:  0.5762 
#> F-statistic:  94.6 on 6 and 407 DF,  p-value: < 0.00000000000000022

Backward Model

model_backward <- step(model_all, direction = "backward", trace = 0)
summary(model_backward)
#> 
#> Call:
#> lm(formula = price ~ date + age + MRT_station + store + lat, 
#>     data = realestate)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -35.625  -5.373  -1.020   4.243  75.343 
#> 
#> Coefficients:
#>                   Estimate     Std. Error t value             Pr(>|t|)    
#> (Intercept) -15964.8038627   3233.1114480  -4.938      0.0000011535027 ***
#> date             5.1375550      1.5544483   3.305              0.00103 ** 
#> age             -0.2693805      0.0384660  -7.003      0.0000000000104 ***
#> MRT_station     -0.0043533      0.0004899  -8.887 < 0.0000000000000002 ***
#> store            1.1361928      0.1876103   6.056      0.0000000031674 ***
#> lat            226.8794043     44.1733708   5.136      0.0000004353834 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.847 on 408 degrees of freedom
#> Multiple R-squared:  0.5823, Adjusted R-squared:  0.5772 
#> F-statistic: 113.8 on 5 and 408 DF,  p-value: < 0.00000000000000022

Predict Values

# model_all
predict_all <- predict(model_all, newdata = test)

# model_backward
predict_backward <- predict(model_backward, newdata = test)

Comparison of The Two Models

compare_performance(model_all, model_backward)
  • AIC value of model_all is higher than model_backward
  • Adjusted-R square value of model_all is lower than model_backward
  • RMSE value of model_all is lower than model_backward

model_all has better perfomance than model_backward based on AIC, Adjuster-R square, and RMSE

MODEL EVALUATION

Mean Absolute Percentage Error (MAPE)

# model_all
MAPE(y_pred = predict_all, y_true = test$price)*100
#> [1] 20.02665
# model_backward
MAPE(y_pred = predict_backward, y_true = test$price)*100
#> [1] 20.05283

Error value of model_all lower than model_backward

check_model(model_all)

## Linearity

linear_all <- data.frame(residual = model_all$residuals, fitted = model_all$fitted.values)

linear_all %>%
  ggplot(aes(fitted, residual)) +
  geom_point() +
  geom_smooth() +
  geom_hline(aes(yintercept = 0)) + 
  theme(panel.grid = element_blank(), panel.background = element_blank())

There is a pattern in the data, with the residuals has become more negative as the fitted values increase before increased again. The pattern indicate that our model may not be linear enough.

Normality of Residuals

shapiro.test(model_all$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_all$residuals
#> W = 0.87622, p-value < 0.00000000000000022
check_normality(model_all)
#> Warning: Non-normality of residuals detected (p < .001).

The null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

Homoscedasticity

bptest(model_all)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all
#> BP = 8.4674, df = 6, p-value = 0.2058
check_heteroscedasticity(model_all)
#> Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.040).

One of method to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value > 0.05, we can conclude that heterocesdasticity is not present in our model.

Multicollinearity

vif(model_all)
#>        date         age MRT_station       store         lat        long 
#>    1.014655    1.014287    4.322984    1.617021    1.610225    2.926305

There is no predictors that has VIF value exceeds 5 or 10. We can conclude there is no multicollinearity in our model.

CONCLUSION

  1. We can use model_all to predict property value.
  2. Since the assumption of linearity is not met, we need to remove the non-linear predictor variables from the model or replace the model with another type of regression, for example tree-based model.
  3. Because the assumption of normality is not met, the results of the significance test and the standard error value of the intercept and slope of each predictor produced are biased or do not reflect the true value. We need to perform data transformation on target variable or add sample data.
  4. From model_all, it can be concluded that the variables date, age, MRT_station, store, and lat have a significant effect (p <0.05) on the property value.

REFERENCES

eh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.