Introduction

Real estate investments are notoriously risky.
However many investors are still willing to take the chance in hopes of a lurcative profit.
In this presentation, we analyse the various factors that decides the price value for residential properties.
We aim to find the feature that proves to be the deciding factor to make a profitable investment.

Problem Statement

In order to find the deciding factor for a profitable investment on residential properties, We determined the relationship between the price of a house and other factors using linear regression

Data

To guide our investigation, We took the “Real estate valuation” data set from UCI repository. The data set consists of real estate valuation collected from Sindian District, New Taipei City, Taiwan during 2012 to 2013.

The available features in the data set are

Transaction date (example: 2013.500= 2013 June)
House age (unit: year)
the distance to the nearest MRT station (unit: meter)
the number of convenience stores in the living circle on foot (integer)
Geographic coordinate, latitude. (unit: degree)
Geographic coordinate, longitude. (unit: degree)
House price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)

During data preparation, we

Renamed the column names the appropriately
Checked for duplicates and null values
Removed ‘Transaction date’ column since, all dates fall between year 2012 to 2013 without any comparison between prices on different periods.

Visualization and Decsriptive Statistics

Price distribution

Below plots represents the price distribution on houses and check for outliers. According to the distribution most houses were priced between 35 to 45

RealEstate$House_price_of_unit_area %>% hist(main="Price distribution of houses", col= "blue", breaks = 20)
grid()

boxplot(RealEstate$House_price_of_unit_area,main = "Box plot of House prices", ylab="Prices")
grid()

House age distribution

Below plots represents the age distribution on houses and check for outliers. The maximum age of the house from the data set is 40 and most number of houses are aged between 10 and 20

RealEstate$House_age %>% hist(main="Age distribution of houses", col= "yellow", breaks = 20)
grid()

boxplot(RealEstate$House_age,main = "Box plot of age of houses", ylab="age")
grid()

Distribution of the distance to the nearest MRT station

Below plots represents the distance to the nearest MRT station distribution on houses and check for outliers.

The maximum amount of house properties were situated closer to the MRT station.
Few outliers were observed. We chose to retain the outliers for data preservation.

RealEstate$Distance_to_the_nearest_MRT_station %>% hist(main="Distribution of the distance to the nearest MRT", col= "green", breaks = 20)
grid()

boxplot(RealEstate$Distance_to_the_nearest_MRT_station ,main = "Box plot of distance to the nearest MRT", ylab="Distance to the nearest MRT")
grid()

Distribution of the number of convienience stores

Below plots represents the number of convienience store closer to houses and check for outliers.

RealEstate$Number_of_convenience_stores %>% hist(main="Distribution of the number of convienience stores", col= "red", breaks = 10)
grid()

boxplot(RealEstate$Number_of_convenience_stores,main = "Box plot of number of convenient stores", ylab="No.of convenient stores")
grid()

Data exploration Summary

Based on the visualization graphs. We incur the below insights

Most houses were priced between 40 to 45
The maximum age of the house from the data set is 40 and houses of age 10 and 20 were most occurring.
Maximum amount of house properties were situated closer to the MRT station
Very few house properties has a convenience store close to it.
Few outliers were observed for “distance MRT station” parameter wasn’t removed as the values might have a significance in determining the house price
The data set doesn’t contain any missing values

Based on the summary statistics of dependent variable House_price, it is observed that the minimum house price is 7.6 and maximum is 78.3 with a mean value of 37.79.

RealEstate %>% summary() -> table_all
knitr::kable(table_all)

No	House_age	Distance_to_the_nearest_MRT_station	Number_of_convenience_stores	Latitude	Longitude	House_price_of_unit_area
Min. : 1	Min. : 0.00	Min. : 23.38	Min. : 0.000	Min. :24.93	Min. :121.5	Min. : 7.60
1st Qu.:104	1st Qu.: 9.00	1st Qu.: 289.32	1st Qu.: 1.000	1st Qu.:24.96	1st Qu.:121.5	1st Qu.:27.70
Median :207	Median :16.10	Median : 492.23	Median : 4.000	Median :24.97	Median :121.5	Median :38.40
Mean :207	Mean :17.73	Mean :1085.90	Mean : 4.102	Mean :24.97	Mean :121.5	Mean :37.79
3rd Qu.:310	3rd Qu.:28.20	3rd Qu.:1455.80	3rd Qu.: 6.000	3rd Qu.:24.98	3rd Qu.:121.5	3rd Qu.:46.60
Max. :413	Max. :43.80	Max. :6488.02	Max. :10.000	Max. :25.01	Max. :121.6	Max. :78.30

RealEstate %>% summarise( Min =min(House_price_of_unit_area, na.rm = TRUE),
                          Q1 = quantile( House_price_of_unit_area, na.rm = TRUE, probs = .25),
                          Median = median(House_price_of_unit_area, na.rm = TRUE),
                          Q3 = quantile(House_price_of_unit_area, na.rm = TRUE, probs = .75),
                          Max = max(House_price_of_unit_area, na.rm = TRUE), 
                          Mean = mean(House_price_of_unit_area,
                          na.rm = TRUE), SD = sd(House_price_of_unit_area, na.rm = TRUE), n= n(), Missing = sum(is.na(House_price_of_unit_area))) -> table1

knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
7.6	27.7	38.4	46.6	78.3	37.78765	13.0461	413	0

Methodology

In order to check based on which factor the price of house varies, scatter plots are drawn pairwise using house price with age, nearest MRT station and number of convenience store features.
Lattitude and longitude features are discarded in this comparison, as all the house properties are from same city in Taiwan and there is no means of comparison based on it.
Based on the pair wise analysis and if it there is a noticable correlation then those relationships will be modelled using a linear regression model.
To validate linear regression we need to check for 4 assumptions such as Independence, Linearity, Normality of residuals and Homoscedasticity.
If the feature satisfies all the assumptions, then it is said to be predictor variable(x) for the dependent variable ‘house price’

Fitting linear regression models and Hypothesis testing

Comparison 1 : Price with age of the house

Hypothesis:

Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model

plot (House_price_of_unit_area ~ House_age, data = RealEstate, main="Comparison of price with age of the house")

Comparison 1 continued

# Fitting a Linear regression model
PriceOnAge <- lm(House_price_of_unit_area ~ House_age, data = RealEstate)
PriceOnAge %>% summary()

## 
## Call:
## lm(formula = House_price_of_unit_area ~ House_age, data = RealEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.895 -10.502   1.837   8.227  45.213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.06794    1.16255  36.186  < 2e-16 ***
## House_age   -0.24142    0.05517  -4.376 1.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.77 on 411 degrees of freedom
## Multiple R-squared:  0.04451,    Adjusted R-squared:  0.04219 
## F-statistic: 19.15 on 1 and 411 DF,  p-value: 1.536e-05

PriceOnAge %>% confint()

##                  2.5 %     97.5 %
## (Intercept) 39.7826466 44.3532305
## House_age   -0.3498796 -0.1329695

Comparison 1 continued

# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnAge %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")
grid()

# 2.Normality distribution of Residuals by Q-Q plot
PriceOnAge %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")
grid()

Comparison 1 : Summary

The below regression model cannot be declared as a best fitted model due to the low R2 value.
Only 4.2% of the variability in price of house can be explained by a linear relationship with its age.
However the overall regression model and its coefficients are still statistically significant according to the p value and the confident intervals.
The model slightly supports the popular fact that modern houses are expensive when compared to old houses.
The Residuals vs Fitted plot shows no pattern in variability, and hence Homoscedasticity can be assured. Also the trend line is almost flat, thus there is no sign of non-linearity.
According to the Q-Q plot the residuals fall close to the line. No obvious departures assures normality of the residuals.

Comparison 2 : Price with distance to the nearest MRT station

Hypothesis:

Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model

plot (House_price_of_unit_area ~ Distance_to_the_nearest_MRT_station, data = RealEstate, main="Comparison of price with distance to the nearest MRT station")

The scatter plot of house prices vs distance to the nearest MRT shows an exponential relationship. Hence both of the attributes were converted in to their log values before constructing the linear regression model.

Comparison 2 continued

Scatter plot for the log convertion of house price and Distance to the nearest MRT is approximately linear and it shows a negative relationship among the two variables.

par(mfrow=c(2,2))
RealEstate$House_price_of_unit_area %>% hist(main = "Price")
log(RealEstate$House_price_of_unit_area) %>% hist(main = "log(Price)")
RealEstate$Distance_to_the_nearest_MRT_station %>% hist(main = "Distance to MRT")
log(RealEstate$Distance_to_the_nearest_MRT_station) %>% hist(main = "log(Distance to MRT)")

plot (log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), data = RealEstate)

Comparison 2 continued

# Fitting a Linear regression model
PriceOnDistance <- lm(log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), data = RealEstate)
PriceOnDistance %>% summary()

## 
## Call:
## lm(formula = log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), 
##     data = RealEstate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64679 -0.12144  0.01408  0.14479  0.73385 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                               5.25862    0.07158   73.47
## log(Distance_to_the_nearest_MRT_station) -0.26507    0.01103  -24.04
##                                          Pr(>|t|)    
## (Intercept)                                <2e-16 ***
## log(Distance_to_the_nearest_MRT_station)   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2508 on 411 degrees of freedom
## Multiple R-squared:  0.5843, Adjusted R-squared:  0.5833 
## F-statistic: 577.8 on 1 and 411 DF,  p-value: < 2.2e-16

PriceOnDistance %>% confint()

##                                              2.5 %     97.5 %
## (Intercept)                               5.117916  5.3993297
## log(Distance_to_the_nearest_MRT_station) -0.286752 -0.2433974

Comparison 2 continued

# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnDistance %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")

# 2.Normality distribution of Residuals by Q-Q plot
PriceOnDistance %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")

Comparison 2 : Summary

By the above linear regression model,it can be concluded that 58% of the variability in log price can be explained by a linear relationship with log distance to the nearest MRT

P value for the F statistic is very low hence we can conclude that the model is statistically significant.
According to the p values and the confident intervals of model coefficients we can observe that the intercept and the slope are also statistically significant.
Furthermore it can be said that as the log value of distance to the nearest MRT station decreases by 1 unit, log value of the price changes on average by 0.265.
The Residuals vs Fitted plot shows no pattern in variability, and hence Homoscedasticity can be assured. Also the trend line is almost flat, thus there is no sign of non-linearity.
According to the Q-Q plot the residuals fall close to the line. No obvious departures assures normality of the residuals.

Comparison 3 : Price with number of convenience stores

Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model

plot (House_price_of_unit_area ~ Number_of_convenience_stores, data = RealEstate, main="comparison of price with number of convenience stores")

# Fitting a Linear regression model
PriceOnNoOfStores <- lm(House_price_of_unit_area ~ Number_of_convenience_stores, data = RealEstate)

Comparison 3: continued

PriceOnNoOfStores %>% summary()

## 
## Call:
## lm(formula = House_price_of_unit_area ~ Number_of_convenience_stores, 
##     data = RealEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.339  -7.098  -1.398   6.002  30.661 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   26.6567     0.8717   30.58   <2e-16 ***
## Number_of_convenience_stores   2.7138     0.1727   15.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.32 on 411 degrees of freedom
## Multiple R-squared:  0.3753, Adjusted R-squared:  0.3738 
## F-statistic: 246.9 on 1 and 411 DF,  p-value: < 2.2e-16

PriceOnNoOfStores %>% confint()

##                                  2.5 %    97.5 %
## (Intercept)                  24.943184 28.370141
## Number_of_convenience_stores  2.374281  3.053226

Comparison 3: continued..

assumption homoscedastisity. Thus this model should not come in to the consideration even though it is statistically significant.

# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnNoOfStores %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")

# 2.Normality distribution of Residuals by Q-Q plot
PriceOnNoOfStores %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")

Discussion

Based on liner regession it can be concluded that

Only 4.2% of the variability in price of house can be explained by a linear relationship with its age.
58% of the variability due to the nearest MRT station distance.
A successfull Simple linear regression model could not be generated to explain the price of house with the predictor, number of convinience stores surrounding the house due to the presence of heteroscedasticity.

Hence to buy a profitable house in Sindian District, New Taipei City, Taiwan, the house must be situated closer to the MRT station.

References

[1] Data Source: https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set
[2] Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.

ANALYSIS ON REAL ESTATE VALUATION

Introduction

Problem Statement

Data

Visualization and Decsriptive Statistics

Price distribution

House age distribution

Distribution of the distance to the nearest MRT station

Distribution of the number of convienience stores

Data exploration Summary

Methodology

Fitting linear regression models and Hypothesis testing

Comparison 1 : Price with age of the house

Comparison 1 continued

Comparison 1 continued

Comparison 1 : Summary

Comparison 2 : Price with distance to the nearest MRT station

Comparison 2 continued

Comparison 2 continued

Comparison 2 continued

Comparison 2 : Summary

Comparison 3 : Price with number of convenience stores

Comparison 3: continued

Comparison 3: continued..

Discussion

References