Bodiyabaduge Dewsri Lalithi Perera (S3762890) and Divya Ulaganathan (S3759465)
Last updated: 02 June, 2019
In order to find the deciding factor for a profitable investment on residential properties, We determined the relationship between the price of a house and other factors using linear regression
To guide our investigation, We took the “Real estate valuation” data set from UCI repository. The data set consists of real estate valuation collected from Sindian District, New Taipei City, Taiwan during 2012 to 2013.
The available features in the data set are
During data preparation, we
Below plots represents the price distribution on houses and check for outliers. According to the distribution most houses were priced between 35 to 45
RealEstate$House_price_of_unit_area %>% hist(main="Price distribution of houses", col= "blue", breaks = 20)
grid()boxplot(RealEstate$House_price_of_unit_area,main = "Box plot of House prices", ylab="Prices")
grid()Below plots represents the age distribution on houses and check for outliers. The maximum age of the house from the data set is 40 and most number of houses are aged between 10 and 20
RealEstate$House_age %>% hist(main="Age distribution of houses", col= "yellow", breaks = 20)
grid()boxplot(RealEstate$House_age,main = "Box plot of age of houses", ylab="age")
grid()Below plots represents the distance to the nearest MRT station distribution on houses and check for outliers.
RealEstate$Distance_to_the_nearest_MRT_station %>% hist(main="Distribution of the distance to the nearest MRT", col= "green", breaks = 20)
grid()boxplot(RealEstate$Distance_to_the_nearest_MRT_station ,main = "Box plot of distance to the nearest MRT", ylab="Distance to the nearest MRT")
grid()Below plots represents the number of convienience store closer to houses and check for outliers.
RealEstate$Number_of_convenience_stores %>% hist(main="Distribution of the number of convienience stores", col= "red", breaks = 10)
grid()boxplot(RealEstate$Number_of_convenience_stores,main = "Box plot of number of convenient stores", ylab="No.of convenient stores")
grid()Based on the visualization graphs. We incur the below insights
Based on the summary statistics of dependent variable House_price, it is observed that the minimum house price is 7.6 and maximum is 78.3 with a mean value of 37.79.
RealEstate %>% summary() -> table_all
knitr::kable(table_all)| No | House_age | Distance_to_the_nearest_MRT_station | Number_of_convenience_stores | Latitude | Longitude | House_price_of_unit_area | |
|---|---|---|---|---|---|---|---|
| Min. : 1 | Min. : 0.00 | Min. : 23.38 | Min. : 0.000 | Min. :24.93 | Min. :121.5 | Min. : 7.60 | |
| 1st Qu.:104 | 1st Qu.: 9.00 | 1st Qu.: 289.32 | 1st Qu.: 1.000 | 1st Qu.:24.96 | 1st Qu.:121.5 | 1st Qu.:27.70 | |
| Median :207 | Median :16.10 | Median : 492.23 | Median : 4.000 | Median :24.97 | Median :121.5 | Median :38.40 | |
| Mean :207 | Mean :17.73 | Mean :1085.90 | Mean : 4.102 | Mean :24.97 | Mean :121.5 | Mean :37.79 | |
| 3rd Qu.:310 | 3rd Qu.:28.20 | 3rd Qu.:1455.80 | 3rd Qu.: 6.000 | 3rd Qu.:24.98 | 3rd Qu.:121.5 | 3rd Qu.:46.60 | |
| Max. :413 | Max. :43.80 | Max. :6488.02 | Max. :10.000 | Max. :25.01 | Max. :121.6 | Max. :78.30 |
RealEstate %>% summarise( Min =min(House_price_of_unit_area, na.rm = TRUE),
Q1 = quantile( House_price_of_unit_area, na.rm = TRUE, probs = .25),
Median = median(House_price_of_unit_area, na.rm = TRUE),
Q3 = quantile(House_price_of_unit_area, na.rm = TRUE, probs = .75),
Max = max(House_price_of_unit_area, na.rm = TRUE),
Mean = mean(House_price_of_unit_area,
na.rm = TRUE), SD = sd(House_price_of_unit_area, na.rm = TRUE), n= n(), Missing = sum(is.na(House_price_of_unit_area))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 7.6 | 27.7 | 38.4 | 46.6 | 78.3 | 37.78765 | 13.0461 | 413 | 0 |
Hypothesis:
Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model
plot (House_price_of_unit_area ~ House_age, data = RealEstate, main="Comparison of price with age of the house")# Fitting a Linear regression model
PriceOnAge <- lm(House_price_of_unit_area ~ House_age, data = RealEstate)
PriceOnAge %>% summary()##
## Call:
## lm(formula = House_price_of_unit_area ~ House_age, data = RealEstate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.895 -10.502 1.837 8.227 45.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.06794 1.16255 36.186 < 2e-16 ***
## House_age -0.24142 0.05517 -4.376 1.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.77 on 411 degrees of freedom
## Multiple R-squared: 0.04451, Adjusted R-squared: 0.04219
## F-statistic: 19.15 on 1 and 411 DF, p-value: 1.536e-05
PriceOnAge %>% confint()## 2.5 % 97.5 %
## (Intercept) 39.7826466 44.3532305
## House_age -0.3498796 -0.1329695
# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnAge %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")
grid()# 2.Normality distribution of Residuals by Q-Q plot
PriceOnAge %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")
grid()Hypothesis:
Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model
plot (House_price_of_unit_area ~ Distance_to_the_nearest_MRT_station, data = RealEstate, main="Comparison of price with distance to the nearest MRT station")The scatter plot of house prices vs distance to the nearest MRT shows an exponential relationship. Hence both of the attributes were converted in to their log values before constructing the linear regression model.
Scatter plot for the log convertion of house price and Distance to the nearest MRT is approximately linear and it shows a negative relationship among the two variables.
par(mfrow=c(2,2))
RealEstate$House_price_of_unit_area %>% hist(main = "Price")
log(RealEstate$House_price_of_unit_area) %>% hist(main = "log(Price)")
RealEstate$Distance_to_the_nearest_MRT_station %>% hist(main = "Distance to MRT")
log(RealEstate$Distance_to_the_nearest_MRT_station) %>% hist(main = "log(Distance to MRT)")plot (log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), data = RealEstate)# Fitting a Linear regression model
PriceOnDistance <- lm(log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station), data = RealEstate)
PriceOnDistance %>% summary()##
## Call:
## lm(formula = log(House_price_of_unit_area) ~ log(Distance_to_the_nearest_MRT_station),
## data = RealEstate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64679 -0.12144 0.01408 0.14479 0.73385
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 5.25862 0.07158 73.47
## log(Distance_to_the_nearest_MRT_station) -0.26507 0.01103 -24.04
## Pr(>|t|)
## (Intercept) <2e-16 ***
## log(Distance_to_the_nearest_MRT_station) <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2508 on 411 degrees of freedom
## Multiple R-squared: 0.5843, Adjusted R-squared: 0.5833
## F-statistic: 577.8 on 1 and 411 DF, p-value: < 2.2e-16
PriceOnDistance %>% confint()## 2.5 % 97.5 %
## (Intercept) 5.117916 5.3993297
## log(Distance_to_the_nearest_MRT_station) -0.286752 -0.2433974
# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnDistance %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")# 2.Normality distribution of Residuals by Q-Q plot
PriceOnDistance %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")By the above linear regression model,it can be concluded that 58% of the variability in log price can be explained by a linear relationship with log distance to the nearest MRT
Null Hypothesis ( \(H_0\)): The data does not fit the linear regression model ; Alternative Hypothesis (\(H_1\)): The data fits the linear regression model
plot (House_price_of_unit_area ~ Number_of_convenience_stores, data = RealEstate, main="comparison of price with number of convenience stores")# Fitting a Linear regression model
PriceOnNoOfStores <- lm(House_price_of_unit_area ~ Number_of_convenience_stores, data = RealEstate)PriceOnNoOfStores %>% summary()##
## Call:
## lm(formula = House_price_of_unit_area ~ Number_of_convenience_stores,
## data = RealEstate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.339 -7.098 -1.398 6.002 30.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6567 0.8717 30.58 <2e-16 ***
## Number_of_convenience_stores 2.7138 0.1727 15.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.32 on 411 degrees of freedom
## Multiple R-squared: 0.3753, Adjusted R-squared: 0.3738
## F-statistic: 246.9 on 1 and 411 DF, p-value: < 2.2e-16
PriceOnNoOfStores %>% confint()## 2.5 % 97.5 %
## (Intercept) 24.943184 28.370141
## Number_of_convenience_stores 2.374281 3.053226
assumption homoscedastisity. Thus this model should not come in to the consideration even though it is statistically significant.
# Validating the assumptions
# 1.Homoscedastisity by Residuals vs. Fitted plot
PriceOnNoOfStores %>% plot(which = 1, main="Homoscedastisity by Residuals vs. Fitted plot")# 2.Normality distribution of Residuals by Q-Q plot
PriceOnNoOfStores %>% plot(which = 2, main="Normality distribution of Residuals by Q-Q plot")Based on liner regession it can be concluded that
Hence to buy a profitable house in Sindian District, New Taipei City, Taiwan, the house must be situated closer to the MRT station.
[1] Data Source: https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set
[2] Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.