Real Estate Data Analysis for Price Prediction Per Unit Area

In this report we take a look inside a Real Estate data set for houses located at .The goal was to find and evaluate a multiple linear regression model for the price per unit area for the houses and we used four variable from the data set which are: house age with a range of 0 to 43.8years, distance Mass Rapid Transit(Subway), number of stores and price per unit area.The price per unit area in a five number summary is 7.60, 27.70, 38.45, 46.60, 117.50

model_estate <- lm(price_per_unit_area ~  house_age+ stores + distance_MRT , data = Real_estate)
model_estate

## 
## Call:
## lm(formula = price_per_unit_area ~ house_age + stores + distance_MRT, 
##     data = Real_estate)
## 
## Coefficients:
##  (Intercept)     house_age        stores  distance_MRT  
##    42.977286     -0.252856      1.297442     -0.005379

summary(model_estate)

## 
## Call:
## lm(formula = price_per_unit_area ~ house_age + stores + distance_MRT, 
##     data = Real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.304  -5.430  -1.738   4.325  77.315 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.977286   1.384542  31.041  < 2e-16 ***
## house_age    -0.252856   0.040105  -6.305 7.47e-10 ***
## stores        1.297443   0.194290   6.678 7.91e-11 ***
## distance_MRT -0.005379   0.000453 -11.874  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.251 on 410 degrees of freedom
## Multiple R-squared:  0.5411, Adjusted R-squared:  0.5377 
## F-statistic: 161.1 on 3 and 410 DF,  p-value: < 2.2e-16

The model summary above shows that the distance_MRT, which is the distance to the subway is statistically significant because it’s p-value is small, approximately equal to 0. The other two variables house age and stores are also statistically significant as their p-value are small and less than the usual level of significance of 0.05. The p-value coefficients for the variables show that the model is good and the variables will not reduce the model’s precision.

The adjusted R-squared is fairly high and removing any of the variables reduces the adjusted R-squared value and this shows that the variables are fit for the model and it’s quality is good.

The overall p-value of 2.2e-16 is small and approximately equal to zero and this supports the decision that changes in the value being predicted (price per unit area) are related to the changes in the response variables i.e house age, number of stores and distance MRT.

res <- resid(model_estate)
plot(fitted(model_estate), res)

In this residual plot above the points are scattered randomly around the residual = 0 line, therefore we can conclude that quality of the model is good.

cormat <- round(cor(Real_estate),2) 
cormat

##                     house_age distance_MRT stores price_per_unit_area
## house_age                1.00         0.03   0.05               -0.21
## distance_MRT             0.03         1.00  -0.60               -0.67
## stores                   0.05        -0.60   1.00                0.57
## price_per_unit_area     -0.21        -0.67   0.57                1.00

The correlation matrix shows the correlation coefficients between variables. There is strong negative correlation between distance MRT and number of stores. The presence of the strong collinearity affects the interpretability of the model as the regression coefficients are not unique and have influences from other features. Because of the presence of collinearity one of the variables is removed to improve the model. In the model below the variable number of stores was removed because it had a higher p-value as compared to distance MRT meaning it is not as statistically significant as distnace MRT.

model_estate <- lm(price_per_unit_area ~  distance_MRT + house_age,  data = Real_estate)
model_estate

## 
## Call:
## lm(formula = price_per_unit_area ~ distance_MRT + house_age, 
##     data = Real_estate)
## 
## Coefficients:
##  (Intercept)  distance_MRT     house_age  
##    49.885586     -0.007209     -0.231027

summary(model_estate)

## 
## Call:
## lm(formula = price_per_unit_area ~ distance_MRT + house_age, 
##     data = Real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.032  -4.742  -1.037   4.533  71.930 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.8855858  0.9677644  51.547  < 2e-16 ***
## distance_MRT -0.0072086  0.0003795 -18.997  < 2e-16 ***
## house_age    -0.2310266  0.0420383  -5.496 6.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.73 on 411 degrees of freedom
## Multiple R-squared:  0.4911, Adjusted R-squared:  0.4887 
## F-statistic: 198.3 on 2 and 411 DF,  p-value: < 2.2e-16

However the quality of the new model without number of stores is not as good as the model that included the variable number of stored as the adjusted R-squared value decreased after removing the variable. Therefore the variable number of stores is to not be removed as this affects the quality of the model.

Reference: Real_estate Data set https://www.kaggle.com/jashwanthram10/multilinear-regression/data?select=Real+estate.csv

Real Estate Data Analysis for Price Prediction Per Unit Area

2022-11-10