This data set is a collection of info surrounding the real estate property of Southeast New Taipei City, Taiwan, or more specifically specifically from the Xindian District. It includes four major pieces of info for each property; these are, house age, distance from metro rail, nearby stores, and arguably most importantly, price per unit area. The data set was collected from the Summer of 2012 to the Summer of 2013, or approximately one year. The ages of the houses range from new to 43.8 years old and the quantity of stores range from zero to ten.
The five number summary for the prices within the data set are as follows:
Minimum = 7.6
1st Quartile = 27.7
Median = 38.45
3rd Quartile = 46.6
Maximum = 117.5
Each of the following plots shows clear correlations between the given variables and the price per unit area. For example, Plot 1 shows a strong negative correlation between the distance from the metro and the price. This can be easily explained by the fact that if you are further from the metro, you are likely in a more rural, and consequently cheaper, area. The same reasoning holds for the positive correlation between the nearby stores and price. The interesting one occurs with the near parabolic correlation between the age of the house and its price. One could conjecture that the reason for the dip in the middle is because people have a clear preference for either new or old houses and less so for the in between. It could also be explained by the older houses being old enough to finally be renovated and be closer to new.
Plot 1:
Plot 2:
Plot 3:
The same results as before are once again shown here with even greater distinction. When you look at the extremes for the quantity of stores vs. price you can see a clear increase in the value of a house when more stores are nearby. This indicates more public recourse and more value for the property.
Plot 4:
The following plots and regressions below act as one of the most beneficial ways to use a data set. By creating a multilinear model based on the strongest correlations from the plot below, we can see a very accurate predictor for price based on the given inputs. We can see the quality of our model through the data within the regressions. For example, the very low p-value indicates that we can reject the null hypothesis for the give variables and therefore trust the correlation. This same idea is shown in the correlation plot by the steepness/color of any variable pairing. One thing that is good to pay attention to though, is the presence of collinearity within a data set as it can create some volatility in the model. In this case, we do have some slight collinearity between the distance from the metro and the stores. This is likely because, as stated earlier, they are both linked to being in a larger metropolitan area. Just to experiment with removing these variables, the bottom two regressions show the model without one or the other. In this case, it is not beneficial as it leads to a lower R-squared value and therefore accuracy of the model.
Plot 5:
Summaries/Regressions:
##
## Call:
## lm(formula = RE$price_per_unit_area ~ RE$house_age + RE$distance_MRT +
## RE$stores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.304 -5.430 -1.738 4.325 77.315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.977286 1.384542 31.041 < 2e-16 ***
## RE$house_age -0.252856 0.040105 -6.305 7.47e-10 ***
## RE$distance_MRT -0.005379 0.000453 -11.874 < 2e-16 ***
## RE$stores 1.297443 0.194290 6.678 7.91e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.251 on 410 degrees of freedom
## Multiple R-squared: 0.5411, Adjusted R-squared: 0.5377
## F-statistic: 161.1 on 3 and 410 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = RE$price_per_unit_area ~ RE$distance_MRT + RE$stores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.515 -5.862 -1.358 4.782 78.588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.1229027 1.2995071 30.106 < 2e-16 ***
## RE$distance_MRT -0.0055780 0.0004728 -11.799 < 2e-16 ***
## RE$stores 1.1975990 0.2025665 5.912 7.11e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.678 on 411 degrees of freedom
## Multiple R-squared: 0.4966, Adjusted R-squared: 0.4941
## F-statistic: 202.7 on 2 and 411 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = RE$price_per_unit_area ~ RE$distance_MRT + RE$house_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.032 -4.742 -1.037 4.533 71.930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.8855858 0.9677644 51.547 < 2e-16 ***
## RE$distance_MRT -0.0072086 0.0003795 -18.997 < 2e-16 ***
## RE$house_age -0.2310266 0.0420383 -5.496 6.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.73 on 411 degrees of freedom
## Multiple R-squared: 0.4911, Adjusted R-squared: 0.4887
## F-statistic: 198.3 on 2 and 411 DF, p-value: < 2.2e-16