Introduction

This document comprises one of the data science projects done for different use cases. Dataset source is from public domain and curated to showcase different techniques and their business value(s).

Primary objectives are to demonstrate following knowledge:

Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)
Identify, implement, test & select the best possible statistics model based on data-based proof
Systematic documentation with reproducible R coding for others to reproduce, understand & improve
Summarise observed findings with concise business values & insights through executive summary
Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application

What it doesn’t consists

Complete data product via dashboard: Visualise processed output data through visualisation tools such as Qlik or Tableau - trained in Data Visualisation ;)
Deep business domain knowledge in the dataset
External dataset(s) identified to enrich or give another dimensions to tackle the business question
Working style or collaboration methods with other data team members (Range from data analyst, business analyst, data scientist, data engineer, visualisation engineer, business stakeholder, project manager)
Other statistics modeling which might be better or can be used as alternative.

Justification:

Limited time & scope in executing each of these project. Refinement is never ending.
Essential goal is to meet the above primary objectives. Not to showcase the breadth of techniques/ statistic modeling.
Acknowledge there is always a better approach or alternative to test the hypothesis.

All data sources used are taken from public domain and projects are implemented using Rstudio. For full markdown version, kindly contact me via teochunwey@gmail.com

Content

Overview
Data Dictionary
Exploratory Data Analysis
Analysis & Modeling
Evaluation & Interpretation
Findings & Executive Summary

Overview

In this project, use case is to determine which is the predictor variable (X) to forecast the resales pricing (response variable/ Y).

Dataset in this context - Singapore HDB resale data from Jan 2015 till July 2018.

Reasoning:

Housing is one of the main topics in Singapore. The resales pricing greatly impact the people’s quality of life and their children upbringing due to the high resales pricing.

Another consideration is that the dataset variables can be easily identified and reproduce in other countries for a similar use case. Examples such as resale price and floor area.

As these modeling are based on historical data with certain assumption. This project did not take into other external factors such as GDP, average household income, politician parties, election period, land of sale price and etcâ¦

Data Dictionary

Name Title Type Unit of Measure Description 1 month Month Datetime (Month) “YYYY-MM” - - 2 town Town Text (General) - - 3 flat_type Flat Type Text (General) - - 4 block Block Text (General) - - 5 street_name Street Name Text (General) - - 6 storey_range Storey Range Text (General) - - 7 floor_area_sqm Floor Area Numeric (General) Sqm - 8 flat_model Flat Model Text (General) - - 9 lease_commence_date Lease Commencement Date Datetime (Year) “YYYY” - - 10 remaining_lease Remaining Lease Numeric (General) Years - 11 resale_price Resale Price Numeric (General) $ -

Data Source: https://data.gov.sg/dataset/resale-flat-prices

Exploratory Data Analysis

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(stats)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

library(corrplot)

## corrplot 0.84 loaded

library(ggfortify)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following object is masked from 'package:randomForest':
## 
##     outlier

library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(caret)

## Loading required package: lattice

df.original<-read.csv('resales 2015.csv')
set.seed(1) #Ensure reproducible code 
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)

##         month            town flat_type block     street_name storey_range
## 18524 2016-01         PUNGGOL    3 ROOM  306A      PUNGGOL PL     16 TO 18
## 25962 2016-06   BUKIT PANJANG    5 ROOM   505     JELAPANG RD     04 TO 06
## 39966 2017-03        CLEMENTI    5 ROOM   337  CLEMENTI AVE 2     16 TO 18
## 63361 2018-04       WOODLANDS    5 ROOM   717 WOODLANDS DR 70     07 TO 09
## 14070 2015-10 KALLANG/WHAMPOA    3 ROOM    28     JLN BAHAGIA     01 TO 03
## 62674 2018-04       PASIR RIS    5 ROOM   226 PASIR RIS ST 21     10 TO 12
##       floor_area_sqm        flat_model lease_commence_date remaining_lease
## 18524             67 Premium Apartment                2012              95
## 25962            123          Improved                1998              80
## 39966            119          Improved                1978              60
## 63361            120          Improved                1997              78
## 14070             53          Standard                1971              54
## 62674            126          Improved                1993              74
##       resale_price
## 18524       407000
## 25962       446888
## 39966       718000
## 63361       395000
## 14070       268500
## 62674       425000

Investigate and compare sample records against data dictionary.

Ensure data dictionary information is consistent and accurate.
Understand each column data values, logic and meaning behind the values.

str(df)

## 'data.frame':    48837 obs. of  11 variables:
##  $ month              : Factor w/ 43 levels "2015-01","2015-02",..: 13 18 27 40 10 40 42 30 29 4 ...
##  $ town               : Factor w/ 26 levels "ANG MO KIO","BEDOK",..: 18 6 10 25 15 17 11 18 13 11 ...
##  $ flat_type          : Factor w/ 7 levels "1 ROOM","2 ROOM",..: 3 5 5 5 3 5 4 4 3 3 ...
##  $ block              : Factor w/ 2207 levels "1","10","100",..: 734 1250 839 1768 617 443 771 561 491 482 ...
##  $ street_name        : Factor w/ 525 levels "ADMIRALTY DR",..: 341 207 128 485 210 321 453 339 234 160 ...
##  $ storey_range       : Factor w/ 17 levels "01 TO 03","04 TO 06",..: 6 2 6 3 1 4 1 5 3 3 ...
##  $ floor_area_sqm     : num  67 123 119 120 53 126 84 93 67 59 ...
##  $ flat_model         : Factor w/ 21 levels "2-room","Adjoined flat",..: 13 5 5 5 18 5 17 13 12 5 ...
##  $ lease_commence_date: int  2012 1998 1978 1997 1971 1993 1985 2013 1983 1977 ...
##  $ remaining_lease    : int  95 80 60 78 54 74 66 94 64 61 ...
##  $ resale_price       : num  407000 446888 718000 395000 268500 ...

Investigate the data types in data frame (df) data structure to ensure no changes is required.

summary(df)

##      month                town                  flat_type    
##  2018-07: 1775   JURONG WEST: 3778   1 ROOM          :   12  
##  2017-11: 1406   SENGKANG   : 3502   2 ROOM          :  504  
##  2018-06: 1393   WOODLANDS  : 3499   3 ROOM          :12517  
##  2017-05: 1386   TAMPINES   : 3299   4 ROOM          :19953  
##  2017-08: 1362   BEDOK      : 2972   5 ROOM          :11912  
##  2018-03: 1332   YISHUN     : 2852   EXECUTIVE       : 3931  
##  (Other):40183   (Other)    :28935   MULTI-GENERATION:    8  
##      block                   street_name      storey_range  
##  2      :  194   YISHUN RING RD    :  805   04 TO 06:11423  
##  1      :  165   BEDOK RESERVOIR RD:  623   07 TO 09:10679  
##  8      :  163   ANG MO KIO AVE 10 :  589   10 TO 12: 9347  
##  101    :  153   PUNGGOL CTRL      :  545   01 TO 03: 8897  
##  113    :  146   ANG MO KIO AVE 3  :  537   13 TO 15: 4385  
##  110    :  145   PUNGGOL FIELD     :  509   16 TO 18: 1867  
##  (Other):47871   (Other)           :45229   (Other) : 2239  
##  floor_area_sqm               flat_model    lease_commence_date
##  Min.   : 31.00   Model A          :14783   Min.   :1966       
##  1st Qu.: 76.00   Improved         :12458   1st Qu.:1984       
##  Median : 97.00   New Generation   : 7629   Median :1990       
##  Mean   : 97.66   Premium Apartment: 5118   Mean   :1992       
##  3rd Qu.:112.00   Simplified       : 2302   3rd Qu.:2001       
##  Max.   :259.00   Apartment        : 1978   Max.   :2016       
##                   (Other)          : 4569                      
##  remaining_lease  resale_price    
##  Min.   :24.00   Min.   : 170000  
##  1st Qu.:66.00   1st Qu.: 340000  
##  Median :73.00   Median : 410000  
##  Mean   :74.06   Mean   : 441000  
##  3rd Qu.:83.00   3rd Qu.: 505000  
##  Max.   :97.00   Max.   :1160000  
##

After which, a summary on each variables to understand the number of observation, distribution on the measures and outliers.

summary(is.na(df))

##    month            town         flat_type         block        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:48837     FALSE:48837     FALSE:48837     FALSE:48837    
##  street_name     storey_range    floor_area_sqm  flat_model     
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:48837     FALSE:48837     FALSE:48837     FALSE:48837    
##  lease_commence_date remaining_lease resale_price   
##  Mode :logical       Mode :logical   Mode :logical  
##  FALSE:48837         FALSE:48837     FALSE:48837

Investigate if there are any missing values (NA) in each variables. For this dataset, luckily is quited clean.

However, if there are NA values, following are the recommended approach.

Exlcude them using R code: na.rm=TRUE -> Generally used on vector
Use default values that is pre-defined based on domain knowledge. Usual suspects are 0, -1 or 9999999999
Recode the missing value with a mean/ median value of that variable
Omit the records with missing value: na.omit(df) -> Subset the dataset

Before jumping into visual analysis, sometimes it is useful to look at frequency of important variables. In this context, town, flat_type, floor area sqm and storey_range are identified.
Based on basic domain understanding and assumption, these are important factors that are often considered which may affect the resale pricing. For illustration sake, we run the results below to show the frequency and distribution for each variables.

table(df$town)

## 
##      ANG MO KIO           BEDOK          BISHAN     BUKIT BATOK 
##            2370            2972             960            1934 
##     BUKIT MERAH   BUKIT PANJANG     BUKIT TIMAH    CENTRAL AREA 
##            1884            1704             137             504 
##   CHOA CHU KANG        CLEMENTI         GEYLANG         HOUGANG 
##            2192            1179            1306            2408 
##     JURONG EAST     JURONG WEST KALLANG/WHAMPOA   MARINE PARADE 
##            1092            3778            1520             299 
##       PASIR RIS         PUNGGOL      QUEENSTOWN       SEMBAWANG 
##            1530            2647            1298            1325 
##        SENGKANG       SERANGOON        TAMPINES       TOA PAYOH 
##            3502            1075            3299            1571 
##       WOODLANDS          YISHUN 
##            3499            2852

table(df$flat_type)

## 
##           1 ROOM           2 ROOM           3 ROOM           4 ROOM 
##               12              504            12517            19953 
##           5 ROOM        EXECUTIVE MULTI-GENERATION 
##            11912             3931                8

table(df$storey_range)

## 
## 01 TO 03 04 TO 06 07 TO 09 10 TO 12 13 TO 15 16 TO 18 19 TO 21 22 TO 24 
##     8897    11423    10679     9347     4385     1867      807      578 
## 25 TO 27 28 TO 30 31 TO 33 34 TO 36 37 TO 39 40 TO 42 43 TO 45 46 TO 48 
##      313      194       83      101       88       48        8       13 
## 49 TO 51 
##        6

Based on above findings, these will give some directions on the distribution and potential outliers (such as outlier or influential outlier).

Next, visualise analysis - number of transactions by month.

ggplot(df,aes(x=month))+geom_bar(color='darkblue', fill='lightblue') + theme(axis.text.x= element_text(angle=90,hjust=1)) + ggtitle("No. of transactions by month")

Above plot observed the following findings:

For the past 3 years+, there seem to have a seasonal trend for resale housing.
Generally, there will be a dip during Chinese New Year (CNY) period and a spike around October.
On a year-on-year basis, annual transaction for each year is trending higher.
During July 2018, there is a big spike of around 2.5k transactions. Most likely due to some external factors or news such as new policy. Required further investigation.

df$year<- as.factor(substr(df$month,1,4))
ggplot(df,aes(x=year))+geom_bar(color='darkblue', fill='lightblue') + theme(axis.text.x= element_text(hjust=1)) + ggtitle("No. of transactions by year")+ geom_text(stat='count',aes(label=..count..),vjust=2)

Moving next, above plot gives a holistic view of the resale transaction by year.

Findings:

This validate the assumption finding on point #3 in the previous plot.
2018 data only consists till July data. Based on past data, total transactions at year end may trend higher barring any unforeseen circumstances.

Based on temporal plots observation, this give us a good understand of what is happening on a macro view. Next will be on the visual analysis - histogram.

ggplot(df,aes(resale_price))+geom_histogram(color='darkblue', fill='lightblue') + ggtitle("Resale pricing histogram")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Above plot is a right-skewed (positive-skew) distribution with a long right tail.

Findings:

Mean is to the right of the peak.
R default no. of bins is 30. We will re-define this bins and replot the histogram.

Based on summary table shown above, min and max of resale price is 170000 and 1180000 respectively. As such, following are computed to get the bin width:

Max - Min : 1180000 - 170000 = 1010000
Total data points: 69767
No. of bins : sqrt(69767) = 264.1344 = 265 (round up)
Bin width : 1010000 / 265 = 3811.321

ggplot(df,aes(resale_price))+geom_histogram(color='grey',fill='peachpuff',binwidth = 3811.321) + ggtitle("Resale pricing histogram") + coord_cartesian(xlim = c(230000, 900000)) + geom_vline(aes(xintercept=mean(resale_price)),color="blue", linetype="dashed", size=1, alpha = 1) + geom_vline(aes(xintercept=median(resale_price)),color="chocolate3", linetype="dashed", size=1, alpha = 1) + geom_text(x = median(df$resale_price), y = 200, label = paste("Median", round(median(df$resale_price), digits = 2), sep = '\n'), color = "chocolate3")+
geom_text(x = mean(df$resale_price), y = 200, label = paste("Mean", round(mean(df$resale_price), digits = 2), sep = '\n'), color = "blue",hjust=-1)

Findings:

Right-skewed distribtion plot has a mean of $440000 resale price and a median of $410000.
The mode (peak) for resale price hover around $420000 mark having a count of nearing 1500 transactions.
Distribution is concentrated mainly on left of the plot.
The right tail is longer.
This gives a distribution pattern of frequency on transaction prices generally falls under.

Note: Regression model is not valid due to the normality distribution on response variable is not met. Either log is square to get a normal distribution.

Steps which can be further taken (not done in this analysis):

Group the distribution by years
Group the distribution by town
Group the distribution by flat type
Group the distribution by flat model
Group the distribution by storey range
25th percentile & 75th percentile lines can be added

log.x <- log(df$resale_price)
ggplot(df,aes(log.x))+geom_histogram(color='grey',fill='peachpuff') + ggtitle("Resale pricing histogram")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Log to assess normality distribution.
Findings:

Bimodal distribution with 2 peaks
Right Skewness still valid
For illustration sake, we will continue the linear regression analysis even the distribution do not valid for a parametric test.

Next step, visual analysis - boxplot on resale price distribution by town, storey range, flat types, etc…

ggplot(data=df, aes(x=town, y=resale_price)) + geom_boxplot() +ggtitle('Distribution of resale price by town (Boxplot 1)') + theme(axis.text.x= element_text(angle=90,hjust=1)) +
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(2,mean(resale_price),label = 'mean', vjust = -1),color='blue')

ggplot(data=df, aes(x=storey_range, y=resale_price)) + geom_boxplot() +ggtitle('Distribution of resale price by storey range (Boxplot 2)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(2,mean(resale_price),label = 'mean', vjust = -1),color='blue')

ggplot(data=df, aes(x=flat_type, y=resale_price)) + geom_boxplot() +ggtitle('Distribution of resale price by flat type (Boxplot 3)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(1,mean(resale_price),label = 'mean', vjust = -1),color='blue')

ggplot(data=df, aes(x=lease_commence_date, y=resale_price,group=lease_commence_date)) + geom_boxplot() +ggtitle('Distribution of resale price by lease commence date (Boxplot 4)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(1966,mean(resale_price),label = 'mean', vjust = -1),color='blue')

ggplot(data=df, aes(x=remaining_lease, y=resale_price,group=remaining_lease)) + geom_boxplot() +ggtitle('Distribution of resale price by remaining lease (Boxplot 5)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(24,mean(resale_price),label = 'mean', vjust = -1),color='blue') +  coord_cartesian(
xlim = c(23, 100))

Based on boxplots above (non exhaustive EDA), below are the findings:

For resale pricing on towm, the higher median (50th percentile) falls onto Bishan, Bukit Timah, Central Area, Bukit Merah, Kallang/ Whampoa, Queenstown, Serangoon. This concides with the domain knowledge as these are the mature estate.
For resale pricing on storey range. Generally, the higher the storey range, the higher the median (50th percentile) for resale pricing.
Most of the outliers for resale pricing fall between 01 to 18 storey. Based on domain knowledge, this may be due to the flats constructed during the early days (1966 to 1990) where the highest is 12 to 18 storey. To add on, these naturally will be the current mature estate where the median is higher the mean (shown in boxplot 1). This explained why the mature estate sell at at higher price.
For resale pricing on flat type, 3/4/5/ executive rooms have the most outliers.
There is a required needs to remove the large amount of outliers as this greatly affect the linear model estimation. This can be checked using cook’s distance plot.

corrplot

cor.result<- cor(df[,c(7,10,11)])
corrplot(cor.result, method="ellipse")

Findings show that resale price and floor area sqm have the highest correlation due to color intensity.

Removing Outliers

#Create quantile on dataset group by floor area sqm
floorarea.Quantile<-ddply(df,'floor_area_sqm',function(x) quantile(x$resale_price))
colnames(floorarea.Quantile)<-c("floor_area_sqm","minValue", "Q1", "Q2", "Q3","maxValue")

#create IQR & Range
floorarea.Quantile$IQR <- floorarea.Quantile$Q3-floorarea.Quantile$Q1
floorarea.Quantile$Range <- floorarea.Quantile$IQR*1.5
floorarea.Quantile$lowOutlier<- (floorarea.Quantile$Q1-floorarea.Quantile$Range)
floorarea.Quantile$upperOutlier<- (floorarea.Quantile$Q3 + floorarea.Quantile$Range)
#create variables
OBS<- data.table(df)
SAM<- data.table(floorarea.Quantile)
#Merge the quantile and dataset together
df101<-merge(OBS,SAM, allow.cartesian = TRUE, by='floor_area_sqm')
#Remove Upper Outlier & Lower Outlier
dfsubset3<-df101[!df101$resale_price > df101$upperOutlier,]
dfsubset3<-dfsubset3[!dfsubset3$resale_price<dfsubset3$lowOutlier,]
#Boxplot to have a visual inspection
ggplot(data=dfsubset3, aes(x=floor_area_sqm, y=resale_price,group=floor_area_sqm)) + geom_boxplot() +ggtitle('Distribution of resale price by floor area sqm (Boxplot 7)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(dfsubset3$resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(24,mean(dfsubset3$resale_price),label = 'mean', hjust =-1),color='blue') +  coord_cartesian(
xlim = c(31, 200))+stat_boxplot(geom ='errorbar')

corrplot

cor.result<- cor(df[,c(7,10,11)])
corrplot(cor.result, method="ellipse")

Rerun the corelation plot without outliers. The previous correlation conclusion stays.

Steps which can be further taken (not done in this analysis):

Scatterplot on 2 measures such as resales price~commence lease date
Above plots can be further grouped by categorical variables.

Analysis & Modeling

Above EDA give a better understanding of data on distribution, outliers in each variables and resale price distribution. With this information, following steps will be performed under this section:

Selection of Modeling
Evaluation

3 approaches in identifying the independent variables (X variable). They are:

Theory - Using others research findings to determine the important independent variables to specify the best model.
Automatic Tools - Principal component regression, Stepwise regression & Best subsets regression.
Domain knowledge - Deep domain knowledge through observation and research will equip one to identify imporant independent variables.

For this project, we are leveraging on point #3 with assumption that floor area sqm identified are important in a resale pricing.

Few things to take note before we move on:

Best model is only as good as the variables identified and measured during the study.
Data sample might be either by chance or data collection methodology. This resulted in false positive & false negative.
Multicollinearity can impact significance and difficulty in assessing predictor/ independent variable significant.
correlation does not imply causation.

Selection of Modeling

Below modeling will be used:

Linear Regression (Continuous Y, Continuous X)
ANOVA (Continuous Y, Categorical X) - Next project

Regression

regressionFloor <- lm(formula = resale_price ~floor_area_sqm, data=dfsubset3)
summary(regressionFloor)

## 
## Call:
## lm(formula = resale_price ~ floor_area_sqm, data = dfsubset3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218968  -53292  -15528   29605  653444 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    59528.45    1683.26   35.37   <2e-16 ***
## floor_area_sqm  3735.28      16.72  223.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87380 on 45734 degrees of freedom
## Multiple R-squared:  0.5217, Adjusted R-squared:  0.5217 
## F-statistic: 4.989e+04 on 1 and 45734 DF,  p-value: < 2.2e-16

Based on linear regression model (resale price ~ floor area sqm), following are assessed:

p value - to assess statistical significance
t value - determine whether the mean of a population significantly differs from a specific value (hypothesized mean) or from the mean of another population.
Coefficient estimate
Diagnostic plots - to assess if the model fit the data

P value

p value is less than 0.05, hence we can safety reject the null hypothesis that floor area does not affect the resale housing price.

t value

t value is greater than 2(more than 30 observations), hence the mean of the sample differ statistically significantly from hypothesized mean to the sample mean.

This means the sample size collected to calculate the t value is more clustered which is statistically significant (not by chance) with a ration of higher signal to lower noise.

In short, the higher the better as it is formulated based on Ratio of Signal(the difference of mean of sample size and hypothesized mean of population) to Noise (the variation).

If t value does not achieve statistically significant, it could be the following reasons:

Difference in signal isn’t large enough
Variation of noise is too big: Due to either special-cause variation which can be identify using control chart
Sample size is too small: This also explains why an extremely large sample can produce statistically significant results even when a difference is very small and has no practical consequence.

Result as follow:

p value : less than 0.05 (Statistically significant) t value : more than 2 (statistically difference)

Coefficient Estimate

Third, visit the coefficients (estimate) for associated changes between dependent (y) and independent variable (x/ predictor).

First row is the intercept, expected resale price of $59528.85 for 0 sqm floor area which in our context doesn’t make sense. You can’t have no house to sell and still have a price tag of $60k.
Second row is the coefficient, for every 1 sqm floor area increase, the resale price increase by $3735.28.

Diagnotics Plots

autoplot(regressionFloor)

The ideal result for above plots should be unbiased (residual randomly distributed) and homoscedastic (spread uniformly around the red line)

Residuals = Observation - fitted value (y value from linear regression formula)

Unbiased: have an average value of zero in any thin vertical strip, and

homoscedastic, which means “same stretch”: the spread of the residuals should be the same in any thin vertical strip.

Residual Vs Fitted

Top Left: The plot is biased and Heteroscedasticity. As such, linear regression do not fit for this dataset.

Normal Q-Q Plot (Quantile-Quantile)

Top Right: The distribution is skewed right based on the tail. The deviation become larger when the resale price increases.

Scale-Location

Bottom Left: Square the residual. Similar to bottom left, this is biased and Heteroscedasticity.

Residual Vs Leverage (Cook’s distance)

Bottom Right: there is no influential observation that is outside the cook’s distance (not seen in the plot).

For sake of illustration, if the residual plot is unbiased and homoscedastic. Next 2 values to assess will be R squared (linear regression), Predicted R squared and F test.

R squared is used as a statistical measure to determine how close the data are to the fitted regression line. The downside is that in a multiple regression, any increase in one independent variable, R square percentage will be increased. Thus, adjusted R square is used.

Adjusted R square: 0.5217 Predicted R square: Predicted R square is to determine how well a regression model makes predictions.

predicted power of regression

pr <- residuals(regressionFloor)/(1 - lm.influence(regressionFloor)$hat)
PRESS <- sum(pr^2)
PRESS #predictive power of regression (smaller better)

## [1] 3.492532e+14

# anova to calculate residual sum of squares
my.anova <- anova(regressionFloor)
tss <- sum(my.anova$"Sum Sq")
# predictive R^2
pred.r.squared <- 1 - PRESS/(tss)
pred.r.squared

## [1] 0.5216753

Predicted R Sq & predicted residual sum of squares(PRESS) are used to judge the power of regression. PRESS is preferred here.

RMSE<-sqrt(mean(regressionFloor$residuals^2))
RMSE

## [1] 87382.22

#predict.model <- predict(regressionFloor,df.test)
#new.modelvalues <- data.frame(obs=df.test$floor_area_sqm, pred='prediction')
#defaultSummary(new.modelvalues)

RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the dataâ“how close the observed data points are to the modelâs predicted values.

In this context, it is not consider a good fit (lower values of RMSE better).

Conclusion: Based on the diagnotics and conclusion, the linear model do not have a good fit.

Kendall-Theil-Siegel regression, and quantile regression are suggested models.

Reference

http://rpubs.com/chunwey/RegressionReference

Regression Project

Chun Wey TEO

4 September 2018

Introduction

What it doesn’t consists

Content

Overview

Data Dictionary

Exploratory Data Analysis

corrplot

Removing Outliers

corrplot

Steps which can be further taken (not done in this analysis):

Analysis & Modeling

3 approaches in identifying the independent variables (X variable). They are:

Few things to take note before we move on:

Selection of Modeling

Regression

Based on linear regression model (resale price ~ floor area sqm), following are assessed:

P value

t value

If t value does not achieve statistically significant, it could be the following reasons:

Result as follow:

Coefficient Estimate

Diagnotics Plots

Residual Vs Fitted

Normal Q-Q Plot (Quantile-Quantile)

Scale-Location

Residual Vs Leverage (Cook’s distance)

predicted power of regression

Reference