2020-04-17 Data Scientist Assignment

Overview

There are multiple factors affecting the selling price of HDB resale flats in Singapore. As a data scientist in ABC, you are tasked to construct a hedonic regression model to estimate how the selling price of a HDB resale flat changes based on its distance to the CBD and the nearest MRT station, as well as its flat size, floor level and the years of lease remaining on the flat.

Hedonic regression is used to estimate the demand for a good or its value to consumers. It breaks down the item being researched into its constituent characteristics, and obtains estimates of the contributory value for each characteristic. Hedonic models are most commonly estimated using regression analysis.

Here, we used a Multiple Linear Regression Model with multiple covariates to predict the selling price of a HDB resale flat. This model is used to explain the relationship between one continuous dependent variable (outcome) and two or more independent variables (predictors). Our desired outcome, selling price of a HDB resale flat is a continuous variable, while our predictors can be continuous or categorical.

For this Question, we used data from data.gov.sg to obtain HDB resale flat prices. This data is based on registration dates from January 2017 onwards.

Loading Libraries

# Loading Libraries
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.6.3

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(corrplot)

## corrplot 0.84 loaded

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

Loading and Cleaning Data

# Loading Data
RFPData <- read.csv('./RFP-2017.csv', header = TRUE)

# Cleaning Data
RFPData <- RFPData[, colSums(is.na(RFPData)) == 0]
summary(RFPData)

##  Distance_to_the_CBD Floor_Area_sqm    Floor_Level     Remaining_Lease
##  Min.   :    0       Min.   : 31.00   Min.   : 3.000   Min.   :45.00  
##  1st Qu.: 9000       1st Qu.: 82.00   1st Qu.: 6.000   1st Qu.:65.00  
##  Median :13300       Median : 95.00   Median : 9.000   Median :74.00  
##  Mean   :12014       Mean   : 97.77   Mean   : 9.609   Mean   :74.14  
##  3rd Qu.:16100       3rd Qu.:113.00   3rd Qu.:12.000   3rd Qu.:83.00  
##  Max.   :18500       Max.   :249.00   Max.   :51.000   Max.   :96.00  
##   Resale_Price    
##  Min.   : 140000  
##  1st Qu.: 330000  
##  Median : 408000  
##  Mean   : 438864  
##  3rd Qu.: 510000  
##  Max.   :1205000

Preparing Datasets for Prediction

We split the data (RFPData) into 50% for training and 50% for testing our Multiple Linear Regression Model.

# Splitting Data
set.seed(1234)
inTrain <- createDataPartition(RFPData$Resale_Price, p = 0.5, list = FALSE)
trainData <- RFPData[inTrain, ]
testData <- RFPData[-inTrain, ]
dim(trainData)

## [1] 34082     5

dim(testData)

## [1] 34079     5

trainData <- as.data.frame(lapply(trainData, as.numeric))
testData <- as.data.frame(lapply(testData, as.numeric))

Plotting a Correlation Matrix for Training Data

# Plotting a Correlation Matrix for Training Data
trainMatrix <- rcorr(as.matrix(trainData))
print(trainMatrix)

##                     Distance_to_the_CBD Floor_Area_sqm Floor_Level
## Distance_to_the_CBD                1.00           0.23       -0.15
## Floor_Area_sqm                     0.23           1.00        0.02
## Floor_Level                       -0.15           0.02        1.00
## Remaining_Lease                    0.30           0.21        0.27
## Resale_Price                      -0.30           0.64        0.37
##                     Remaining_Lease Resale_Price
## Distance_to_the_CBD            0.30        -0.30
## Floor_Area_sqm                 0.21         0.64
## Floor_Level                    0.27         0.37
## Remaining_Lease                1.00         0.32
## Resale_Price                   0.32         1.00
## 
## n= 34082 
## 
## 
## P
##                     Distance_to_the_CBD Floor_Area_sqm Floor_Level
## Distance_to_the_CBD                     0e+00          0e+00      
## Floor_Area_sqm      0e+00                              2e-04      
## Floor_Level         0e+00               2e-04                     
## Remaining_Lease     0e+00               0e+00          0e+00      
## Resale_Price        0e+00               0e+00          0e+00      
##                     Remaining_Lease Resale_Price
## Distance_to_the_CBD 0e+00           0e+00       
## Floor_Area_sqm      0e+00           0e+00       
## Floor_Level         0e+00           0e+00       
## Remaining_Lease                     0e+00       
## Resale_Price        0e+00

flattenCorrMatrix <- function(cormat, pmat){
  ut <- upper.tri(cormat)
  data.frame(row = rownames(cormat)[row(cormat)[ut]],
             column = rownames(cormat)[col(cormat)[ut]],
             cor = (cormat)[ut],
             p = (pmat)[ut])}
flattentrainMatrix <- flattenCorrMatrix(trainMatrix$r, trainMatrix$P)
print(flattentrainMatrix)

##                    row          column         cor            p
## 1  Distance_to_the_CBD  Floor_Area_sqm  0.23295566 0.0000000000
## 2  Distance_to_the_CBD     Floor_Level -0.15157569 0.0000000000
## 3       Floor_Area_sqm     Floor_Level  0.01998672 0.0002242157
## 4  Distance_to_the_CBD Remaining_Lease  0.30334392 0.0000000000
## 5       Floor_Area_sqm Remaining_Lease  0.20775208 0.0000000000
## 6          Floor_Level Remaining_Lease  0.27320180 0.0000000000
## 7  Distance_to_the_CBD    Resale_Price -0.29944217 0.0000000000
## 8       Floor_Area_sqm    Resale_Price  0.63533640 0.0000000000
## 9          Floor_Level    Resale_Price  0.37003704 0.0000000000
## 10     Remaining_Lease    Resale_Price  0.32201609 0.0000000000

From the Summary shown above, there are several associations between HDB Resale Flat Price and the other variables.

There is a negative correlation (-0.30) between HDB Resale Flat Price and the Flat’s Distance to the CBD. This suggests that HDB Resale Flat Price increases as the Flat’s Distance to the CBD decreases.

There is a positive correlation (0.64) between HDB Resale Flat Price and the Flat’s Floor Area (sqm). This suggests that HDB Resale Flat Price increases as the Flat’s Floor Area (sqm) increases.

There is a positive correlation (0.37) between HDB Resale Flat Price and the Flat’s Floor Level. This suggests that HDB Resale Flat Price increases as the Flat’s Floor Level increases.

There is a positive correlation (0.32) between HDB Resale Flat Price and the Flat’s Remaining Lease. This suggests that HDB Resale Flat Price increases as the Flat’s Remaining Lease increases.

Plotting a Correlation Plot for Training Data

# Plotting a Correlation Plot for Training Data
corrplot(trainMatrix$r, type = "upper", order = "FPC", method = "color",
         p.mat = trainMatrix$P, sig.level = 0.01, insig = "pch",
         tl.cex = 0.8, tl.col = "black", tl.srt = 45)

In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a level of significance of 0.01 to determine correlations that are statistically significant. Correlations with a p-value > 0.01 are considered statistically insignificant and will be marked with a cross. As there are no boxes marked with a cross, all the correlations shown above are statistically significant.

Even though our variables are not highly correlated with one another, their correlations are statistically significant. Therefore, we will be using all our variables to build our Multiple Linear Regression Model.

Building our Multiple Linear Regression Model

# Building our Multiple Linear Regression Model
modFit <- train(Resale_Price ~ Distance_to_the_CBD + Floor_Area_sqm +
                  Floor_Level + Remaining_Lease, method = "lm", data = trainData)
finMod <- modFit$finalModel
print(modFit)

## Linear Regression 
## 
## 34082 samples
##     4 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 34082, 34082, 34082, 34082, 34082, 34082, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   75490.39  0.7594656  58502.51
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

print(finMod)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
##         (Intercept)  Distance_to_the_CBD       Floor_Area_sqm  
##            -88071.4                -17.4               4390.4  
##         Floor_Level      Remaining_Lease  
##              5442.9               3435.1

Plotting Regression Diagnostic Plots

# Plotting Regression Diagnostic Plots
par(mfrow = c(2,2))
plot(finMod)

The Regression Diagnostic Plots show residuals in 4 different ways:

Residuals vs Fitted - Used to examine the linearity assumptions. A horizontal line, without distinct patterns, is a good indication of linearity.
Normal Q-Q - Used to examine whether the residuals are normally distributed. The data is good if residual points follow the straight dashed line.
Scale-Location - Used to examine the homogeneity in variances of residuals (Homoscedasticity). A horizontal line with equally spread points is a good indication of Homoscedasticity.
Residuals vs Leverage - Used to identify extreme values that may influence the analysis.

Linearity of Data

The linearity assumptions can be examined by inspecting the Residuals vs Fitted Plot. Fitted values are predictions derived from our model and training data, while residuals are the difference between the observed and estimated values.

# Plotting the Residuals vs Fitted Plot
plot(finMod, 1, pch = 19, cex = 0.5)

From our Residuals vs Fitted Plot shown above, it suggests that our data may not have good linearity. Characteristics to support this include:

The residuals are not well distributed around the 0 straight dashed line, which usually suggests good linearity.
The residuals do not form a horizontal band around the 0 straight dashed line, which usually suggests similar variances.
Some residuals stand out from the distribution pattern, suggesting potential outliers.

Normality of Residuals

The normality assumptions can be examined by inspecting the Normal Q-Q Plot.

# Plotting the Normal Q-Q Plot
plot(finMod, 2, pch = 19, cex = 0.5)

From our Normal Q-Q Plot shown above, it suggests that our data has good normality of residuals. Characteristics to support this include:

The residual points fall approximately along the straight dashed line.

However, there are some outliers in the Theoretical Quantiles range of 3 to 4. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.

Homoscedasticity

Homogeneity in variances of residuals can be examined by inspecting the Scale-Location Plot.

# Plotting the Scale-Location Plot
plot(finMod, 3, pch = 19, cex = 0.5)

From our Scale-Location Plot shown above, it suggests that our data has good homoscedasticity. Characteristics to support this include:

The residuals are equally spread along the range of predictors.
The residuals are equally spread around the horizontal red line.

Outliers, High Leverage Points and Influential Values

Outliers, High Leverage Points and Influential Values can be identified by inspecting Cook’s Distance and the Residuals vs Leverage Plot. An Outlier is a data point that has an extreme outcome variable value. A High Leverage Point is a data point that has an extreme predictor variable value. An Influential Value is associated with a large residual and its inclusion or exclusion can alter the regression analysis.

Not all extreme data points are influential in regression analysis. Cook’s Distance is a metric used to determine the influence of a value. It defines influence as a combination of leverage and residual size.

# Plotting the Residuals vs Leverage Plot
par(mfrow = c(1,2))
plot(finMod, 4, pch = 19, cex = 0.5)
plot(finMod, 5, pch = 19, cex = 0.5)

From our Residuals vs Leverage Plot shown above, our data does not present with any influential points. Characteristics to support this include:

All data points are well inside the Cook’s Distance Lines, which are represented by the red dashed line in our Residuals vs Leverage Plot.

However, there are some outliers as shown in the plots above. These outliers will need to be looked at individually to examine for anything unique or whether they are data entry errors.

Prediction with our Multiple Linear Regression Model

# Prediction with our Multiple Linear Regression Model
Prediction <- predict(modFit, testData)
qplot(Resale_Price, Prediction, colour = Distance_to_the_CBD, data = testData)

qplot(Resale_Price, Prediction, colour = Floor_Area_sqm, data = testData)

qplot(Resale_Price, Prediction, colour = Floor_Level, data = testData)

qplot(Resale_Price, Prediction, colour = Remaining_Lease, data = testData)

Summary of our Prediction

# Summary of our Prediction
summary(Prediction)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21443  342762  434242  439280  527732  969974