ANALYSIS ON REAL ESTATE VALUATION

Bodiyabaduge Dewsri Lalithi Perera (S3762890) and Divya Ulaganathan (S3759465)

Last updated: 02 June, 2019

Introduction

Problem Statement

In order to find the deciding factor for a profitable investment on residential properties, We determined the relationship between the price of a house and other factors using linear regression

Data

To guide our investigation, We took the “Real estate valuation” data set from UCI repository. The data set consists of real estate valuation collected from Sindian District, New Taipei City, Taiwan during 2012 to 2013.

The available features in the data set are

During data preparation, we

Visualization and Decsriptive Statistics

Price distribution

Below plots represents the price distribution on houses and check for outliers. According to the distribution most houses were priced between 35 to 45

RealEstate$House_price_of_unit_area %>% hist(main="Price distribution of houses", col= "blue", breaks = 20)
grid()

boxplot(RealEstate$House_price_of_unit_area,main = "Box plot of House prices", ylab="Prices")
grid()

House age distribution

Below plots represents the age distribution on houses and check for outliers. The maximum age of the house from the data set is 40 and most number of houses are aged between 10 and 20

RealEstate$House_age %>% hist(main="Age distribution of houses", col= "yellow", breaks = 20)
grid()

boxplot(RealEstate$House_age,main = "Box plot of age of houses", ylab="age")
grid()

Distribution of the distance to the nearest MRT station

Below plots represents the distance to the nearest MRT station distribution on houses and check for outliers.

RealEstate$Distance_to_the_nearest_MRT_station %>% hist(main="Distribution of the distance to the nearest MRT", col= "green", breaks = 20)
grid()

boxplot(RealEstate$Distance_to_the_nearest_MRT_station ,main = "Box plot of distance to the nearest MRT", ylab="Distance to the nearest MRT")
grid()

Distribution of the number of convienience stores

Below plots represents the number of convienience store closer to houses and check for outliers.

RealEstate$Number_of_convenience_stores %>% hist(main="Distribution of the number of convienience stores", col= "red", breaks = 10)
grid()

boxplot(RealEstate$Number_of_convenience_stores,main = "Box plot of number of convenient stores", ylab="No.of convenient stores")
grid()

Data exploration Summary

Based on the visualization graphs. We incur the below insights

Based on the summary statistics of dependent variable House_price, it is observed that the minimum house price is 7.6 and maximum is 78.3 with a mean value of 37.79.

RealEstate %>% summary() -> table_all
knitr::kable(table_all)
No House_age Distance_to_the_nearest_MRT_station Number_of_convenience_stores Latitude Longitude House_price_of_unit_area
Min. : 1 Min. : 0.00 Min. : 23.38 Min. : 0.000 Min. :24.93 Min. :121.5 Min. : 7.60
1st Qu.:104 1st Qu.: 9.00 1st Qu.: 289.32 1st Qu.: 1.000 1st Qu.:24.96 1st Qu.:121.5 1st Qu.:27.70
Median :207 Median :16.10 Median : 492.23 Median : 4.000 Median :24.97 Median :121.5 Median :38.40
Mean :207 Mean :17.73 Mean :1085.90 Mean : 4.102 Mean :24.97 Mean :121.5 Mean :37.79
3rd Qu.:310 3rd Qu.:28.20 3rd Qu.:1455.80 3rd Qu.: 6.000 3rd Qu.:24.98 3rd Qu.:121.5 3rd Qu.:46.60
Max. :413 Max. :43.80 Max. :6488.02 Max. :10.000 Max. :25.01 Max. :121.6 Max. :78.30
RealEstate %>% summarise( Min =min(House_price_of_unit_area, na.rm = TRUE),
                          Q1 = quantile( House_price_of_unit_area, na.rm = TRUE, probs = .25),
                          Median = median(House_price_of_unit_area, na.rm = TRUE),
                          Q3 = quantile(House_price_of_unit_area, na.rm = TRUE, probs = .75),
                          Max = max(House_price_of_unit_area, na.rm = TRUE), 
                          Mean = mean(House_price_of_unit_area,
                          na.rm = TRUE), SD = sd(House_price_of_unit_area, na.rm = TRUE), n= n(), Missing = sum(is.na(House_price_of_unit_area))) -> table1

knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
7.6 27.7 38.4 46.6 78.3 37.78765 13.0461 413 0

Methodology

Fitting linear regression models and Hypothesis testing

Comparison 1 : Price with age of the house

Hypothesis:

Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model

plot (House_price_of_unit_area ~ House_age, data = RealEstate, main="Comparison of price with age of the house")

Comparison 1 continued

# Fitting a Linear regression model
PriceOnAge <- lm(House_price_of_unit_area ~ House_age, data = RealEstate)
PriceOnAge %>% summary()
## 
## Call:
## lm(formula = House_price_of_unit_area ~ House_age, data = RealEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.895 -10.502   1.837   8.227  45.213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.06794    1.16255  36.186  < 2e-16 ***
## House_age   -0.24142    0.05517  -4.376 1.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.77 on 411 degrees of freedom
## Multiple R-squared:  0.04451,    Adjusted R-squared:  0.04219 
## F-statistic: 19.15 on 1 and 411 DF,  p-value: 1.536e-05
PriceOnAge %>% confint()
##                  2.5 %     97.5 %
## (Intercept) 39.7826466 44.3532305
## House_age   -0.3498796 -0.1329695

Comparison 1 continued

# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnAge %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")
grid()

# 2.Normality distribution of Residuals by Q-Q plot
PriceOnAge %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")
grid()

Comparison 1 : Summary

Comparison 2 : Price with distance to the nearest MRT station

Hypothesis:

Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model

plot (House_price_of_unit_area ~ Distance_to_the_nearest_MRT_station, data = RealEstate, main="Comparison of price with distance to the nearest MRT station")

The scatter plot of house prices vs distance to the nearest MRT shows an exponential relationship. Hence both of the attributes were converted in to their log values before constructing the linear regression model.

Comparison 2 continued

Scatter plot for the log convertion of house price and Distance to the nearest MRT is approximately linear and it shows a negative relationship among the two variables.

par(mfrow=c(2,2))
RealEstate$House_price_of_unit_area %>% hist(main = "Price")
log(RealEstate$House_price_of_unit_area) %>% hist(main = "log(Price)")
RealEstate$Distance_to_the_nearest_MRT_station %>% hist(main = "Distance to MRT")
log(RealEstate$Distance_to_the_nearest_MRT_station) %>% hist(main = "log(Distance to MRT)")

plot (log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), data = RealEstate)

Comparison 2 continued

# Fitting a Linear regression model
PriceOnDistance <- lm(log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), data = RealEstate)
PriceOnDistance %>% summary()
## 
## Call:
## lm(formula = log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), 
##     data = RealEstate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64679 -0.12144  0.01408  0.14479  0.73385 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                               5.25862    0.07158   73.47
## log(Distance_to_the_nearest_MRT_station) -0.26507    0.01103  -24.04
##                                          Pr(>|t|)    
## (Intercept)                                <2e-16 ***
## log(Distance_to_the_nearest_MRT_station)   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2508 on 411 degrees of freedom
## Multiple R-squared:  0.5843, Adjusted R-squared:  0.5833 
## F-statistic: 577.8 on 1 and 411 DF,  p-value: < 2.2e-16
PriceOnDistance %>% confint()
##                                              2.5 %     97.5 %
## (Intercept)                               5.117916  5.3993297
## log(Distance_to_the_nearest_MRT_station) -0.286752 -0.2433974

Comparison 2 continued

# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnDistance %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")

# 2.Normality distribution of Residuals by Q-Q plot
PriceOnDistance %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")

Comparison 2 : Summary

By the above linear regression model,it can be concluded that 58% of the variability in log price can be explained by a linear relationship with log distance to the nearest MRT

Comparison 3 : Price with number of convenience stores

Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model

plot (House_price_of_unit_area ~ Number_of_convenience_stores, data = RealEstate, main="comparison of price with number of convenience stores")

# Fitting a Linear regression model
PriceOnNoOfStores <- lm(House_price_of_unit_area ~ Number_of_convenience_stores, data = RealEstate)

Comparison 3: continued

PriceOnNoOfStores %>% summary()
## 
## Call:
## lm(formula = House_price_of_unit_area ~ Number_of_convenience_stores, 
##     data = RealEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.339  -7.098  -1.398   6.002  30.661 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   26.6567     0.8717   30.58   <2e-16 ***
## Number_of_convenience_stores   2.7138     0.1727   15.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.32 on 411 degrees of freedom
## Multiple R-squared:  0.3753, Adjusted R-squared:  0.3738 
## F-statistic: 246.9 on 1 and 411 DF,  p-value: < 2.2e-16
PriceOnNoOfStores %>% confint()
##                                  2.5 %    97.5 %
## (Intercept)                  24.943184 28.370141
## Number_of_convenience_stores  2.374281  3.053226

Comparison 3: continued..

assumption homoscedastisity. Thus this model should not come in to the consideration even though it is statistically significant.

# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnNoOfStores %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")

# 2.Normality distribution of Residuals by Q-Q plot
PriceOnNoOfStores %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")

Discussion

Based on liner regession it can be concluded that

Hence to buy a profitable house in Sindian District, New Taipei City, Taiwan, the house must be situated closer to the MRT station.

References