There are multiple factors affecting the selling price of HDB resale flats in Singapore. As a data scientist in ABC, you are tasked to construct a hedonic regression model to estimate how the selling price of a HDB resale flat changes based on its distance to the CBD and the nearest MRT station, as well as its flat size, floor level and the years of lease remaining on the flat.
Hedonic regression is used to estimate the demand for a good or its value to consumers. It breaks down the item being researched into its constituent characteristics, and obtains estimates of the contributory value for each characteristic. Hedonic models are most commonly estimated using regression analysis.
Here, we used a Multiple Linear Regression Model with multiple covariates to predict the selling price of a HDB resale flat. This model is used to explain the relationship between one continuous dependent variable (outcome) and two or more independent variables (predictors). Our desired outcome, selling price of a HDB resale flat is a continuous variable, while our predictors can be continuous or categorical.
For this Question, we used data from data.gov.sg to obtain HDB resale flat prices. This data is based on registration dates from January 2017 onwards.
# Loading Libraries
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(corrplot)
## corrplot 0.84 loaded
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
# Loading Data
RFPData <- read.csv('./RFP-2017.csv', header = TRUE)
# Cleaning Data
RFPData <- RFPData[, colSums(is.na(RFPData)) == 0]
summary(RFPData)
## Distance_to_the_CBD Floor_Area_sqm Floor_Level Remaining_Lease
## Min. : 0 Min. : 31.00 Min. : 3.000 Min. :45.00
## 1st Qu.: 9000 1st Qu.: 82.00 1st Qu.: 6.000 1st Qu.:65.00
## Median :13300 Median : 95.00 Median : 9.000 Median :74.00
## Mean :12014 Mean : 97.77 Mean : 9.609 Mean :74.14
## 3rd Qu.:16100 3rd Qu.:113.00 3rd Qu.:12.000 3rd Qu.:83.00
## Max. :18500 Max. :249.00 Max. :51.000 Max. :96.00
## Resale_Price
## Min. : 140000
## 1st Qu.: 330000
## Median : 408000
## Mean : 438864
## 3rd Qu.: 510000
## Max. :1205000
We split the data (RFPData) into 50% for training and 50% for testing our Multiple Linear Regression Model.
# Splitting Data
set.seed(1234)
inTrain <- createDataPartition(RFPData$Resale_Price, p = 0.5, list = FALSE)
trainData <- RFPData[inTrain, ]
testData <- RFPData[-inTrain, ]
dim(trainData)
## [1] 34082 5
dim(testData)
## [1] 34079 5
trainData <- as.data.frame(lapply(trainData, as.numeric))
testData <- as.data.frame(lapply(testData, as.numeric))
# Plotting a Correlation Matrix for Training Data
trainMatrix <- rcorr(as.matrix(trainData))
print(trainMatrix)
## Distance_to_the_CBD Floor_Area_sqm Floor_Level
## Distance_to_the_CBD 1.00 0.23 -0.15
## Floor_Area_sqm 0.23 1.00 0.02
## Floor_Level -0.15 0.02 1.00
## Remaining_Lease 0.30 0.21 0.27
## Resale_Price -0.30 0.64 0.37
## Remaining_Lease Resale_Price
## Distance_to_the_CBD 0.30 -0.30
## Floor_Area_sqm 0.21 0.64
## Floor_Level 0.27 0.37
## Remaining_Lease 1.00 0.32
## Resale_Price 0.32 1.00
##
## n= 34082
##
##
## P
## Distance_to_the_CBD Floor_Area_sqm Floor_Level
## Distance_to_the_CBD 0e+00 0e+00
## Floor_Area_sqm 0e+00 2e-04
## Floor_Level 0e+00 2e-04
## Remaining_Lease 0e+00 0e+00 0e+00
## Resale_Price 0e+00 0e+00 0e+00
## Remaining_Lease Resale_Price
## Distance_to_the_CBD 0e+00 0e+00
## Floor_Area_sqm 0e+00 0e+00
## Floor_Level 0e+00 0e+00
## Remaining_Lease 0e+00
## Resale_Price 0e+00
flattenCorrMatrix <- function(cormat, pmat){
ut <- upper.tri(cormat)
data.frame(row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor = (cormat)[ut],
p = (pmat)[ut])}
flattentrainMatrix <- flattenCorrMatrix(trainMatrix$r, trainMatrix$P)
print(flattentrainMatrix)
## row column cor p
## 1 Distance_to_the_CBD Floor_Area_sqm 0.23295566 0.0000000000
## 2 Distance_to_the_CBD Floor_Level -0.15157569 0.0000000000
## 3 Floor_Area_sqm Floor_Level 0.01998672 0.0002242157
## 4 Distance_to_the_CBD Remaining_Lease 0.30334392 0.0000000000
## 5 Floor_Area_sqm Remaining_Lease 0.20775208 0.0000000000
## 6 Floor_Level Remaining_Lease 0.27320180 0.0000000000
## 7 Distance_to_the_CBD Resale_Price -0.29944217 0.0000000000
## 8 Floor_Area_sqm Resale_Price 0.63533640 0.0000000000
## 9 Floor_Level Resale_Price 0.37003704 0.0000000000
## 10 Remaining_Lease Resale_Price 0.32201609 0.0000000000
From the Summary shown above, there are several associations between HDB Resale Flat Price and the other variables.
There is a negative correlation (-0.30) between HDB Resale Flat Price and the Flat’s Distance to the CBD. This suggests that HDB Resale Flat Price increases as the Flat’s Distance to the CBD decreases.
There is a positive correlation (0.64) between HDB Resale Flat Price and the Flat’s Floor Area (sqm). This suggests that HDB Resale Flat Price increases as the Flat’s Floor Area (sqm) increases.
There is a positive correlation (0.37) between HDB Resale Flat Price and the Flat’s Floor Level. This suggests that HDB Resale Flat Price increases as the Flat’s Floor Level increases.
There is a positive correlation (0.32) between HDB Resale Flat Price and the Flat’s Remaining Lease. This suggests that HDB Resale Flat Price increases as the Flat’s Remaining Lease increases.
# Plotting a Correlation Plot for Training Data
corrplot(trainMatrix$r, type = "upper", order = "FPC", method = "color",
p.mat = trainMatrix$P, sig.level = 0.01, insig = "pch",
tl.cex = 0.8, tl.col = "black", tl.srt = 45)
In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a level of significance of 0.01 to determine correlations that are statistically significant. Correlations with a p-value > 0.01 are considered statistically insignificant and will be marked with a cross. As there are no boxes marked with a cross, all the correlations shown above are statistically significant.
Even though our variables are not highly correlated with one another, their correlations are statistically significant. Therefore, we will be using all our variables to build our Multiple Linear Regression Model.
# Building our Multiple Linear Regression Model
modFit <- train(Resale_Price ~ Distance_to_the_CBD + Floor_Area_sqm +
Floor_Level + Remaining_Lease, method = "lm", data = trainData)
finMod <- modFit$finalModel
print(modFit)
## Linear Regression
##
## 34082 samples
## 4 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 34082, 34082, 34082, 34082, 34082, 34082, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 75490.39 0.7594656 58502.51
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(finMod)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) Distance_to_the_CBD Floor_Area_sqm
## -88071.4 -17.4 4390.4
## Floor_Level Remaining_Lease
## 5442.9 3435.1
# Plotting Regression Diagnostic Plots
par(mfrow = c(2,2))
plot(finMod)
The Regression Diagnostic Plots show residuals in 4 different ways:
The linearity assumptions can be examined by inspecting the Residuals vs Fitted Plot. Fitted values are predictions derived from our model and training data, while residuals are the difference between the observed and estimated values.
# Plotting the Residuals vs Fitted Plot
plot(finMod, 1, pch = 19, cex = 0.5)
From our Residuals vs Fitted Plot shown above, it suggests that our data may not have good linearity. Characteristics to support this include:
The normality assumptions can be examined by inspecting the Normal Q-Q Plot.
# Plotting the Normal Q-Q Plot
plot(finMod, 2, pch = 19, cex = 0.5)
From our Normal Q-Q Plot shown above, it suggests that our data has good normality of residuals. Characteristics to support this include:
However, there are some outliers in the Theoretical Quantiles range of 3 to 4. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.
Homogeneity in variances of residuals can be examined by inspecting the Scale-Location Plot.
# Plotting the Scale-Location Plot
plot(finMod, 3, pch = 19, cex = 0.5)
From our Scale-Location Plot shown above, it suggests that our data has good homoscedasticity. Characteristics to support this include:
Outliers, High Leverage Points and Influential Values can be identified by inspecting Cook’s Distance and the Residuals vs Leverage Plot. An Outlier is a data point that has an extreme outcome variable value. A High Leverage Point is a data point that has an extreme predictor variable value. An Influential Value is associated with a large residual and its inclusion or exclusion can alter the regression analysis.
Not all extreme data points are influential in regression analysis. Cook’s Distance is a metric used to determine the influence of a value. It defines influence as a combination of leverage and residual size.
# Plotting the Residuals vs Leverage Plot
par(mfrow = c(1,2))
plot(finMod, 4, pch = 19, cex = 0.5)
plot(finMod, 5, pch = 19, cex = 0.5)
From our Residuals vs Leverage Plot shown above, our data does not present with any influential points. Characteristics to support this include:
However, there are some outliers as shown in the plots above. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.
# Prediction with our Multiple Linear Regression Model
Prediction <- predict(modFit, testData)
qplot(Resale_Price, Prediction, colour = Distance_to_the_CBD, data = testData)
qplot(Resale_Price, Prediction, colour = Floor_Area_sqm, data = testData)
qplot(Resale_Price, Prediction, colour = Floor_Level, data = testData)
qplot(Resale_Price, Prediction, colour = Remaining_Lease, data = testData)
# Summary of our Prediction
summary(Prediction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21443 342762 434242 439280 527732 969974