Data Set Analysis

This data set is a collection of info surrounding the real estate property of Southeast New Taipei City, Taiwan, or more specifically specifically from the Xindian District. It includes four major pieces of info for each property; these are, house age, distance from metro rail, nearby stores, and arguably most importantly, price per unit area. The data set was collected from the Summer of 2012 to the Summer of 2013, or approximately one year. The ages of the houses range from new to 43.8 years old and the quantity of stores range from zero to ten.

 


Five Number Summary For Prices

The five number summary for the prices within the data set are as follows:


Graphing Price Against The Various Variables

Each of the following plots shows clear correlations between the given variables and the price per unit area. For example, Plot 1 shows a strong negative correlation between the distance from the metro and the price. This can be easily explained by the fact that if you are further from the metro, you are likely in a more rural, and consequently cheaper, area. The same reasoning holds for the positive correlation between the nearby stores and price. The interesting one occurs with the near parabolic correlation between the age of the house and its price. One could conjecture that the reason for the dip in the middle is because people have a clear preference for either new or old houses and less so for the in between. It could also be explained by the older houses being old enough to finally be renovated and be closer to new.

 

Plot 1:

 

Plot 2:

 

Plot 3:

 


Another Look At Stores Effect On Price

The same results as before are once again shown here with even greater distinction. When you look at the extremes for the quantity of stores vs. price you can see a clear increase in the value of a house when more stores are nearby. This indicates more public recourse and more value for the property.

 

Plot 4:

 


Examining The Correlations Between The Data And Bringing It All Together Into A Model

The following plots and regressions below act as one of the most beneficial ways to use a data set. By creating a multilinear model based on the strongest correlations from the plot below, we can see a very accurate predictor for price based on the given inputs. We can see the quality of our model through the data within the regressions. For example, the very low p-value indicates that we can reject the null hypothesis for the give variables and therefore trust the correlation. This same idea is shown in the correlation plot by the steepness/color of any variable pairing. One thing that is good to pay attention to though, is the presence of collinearity within a data set as it can create some volatility in the model. In this case, we do have some slight collinearity between the distance from the metro and the stores. This is likely because, as stated earlier, they are both linked to being in a larger metropolitan area. Just to experiment with removing these variables, the bottom two regressions show the model without one or the other. In this case, it is not beneficial as it leads to a lower R-squared value and therefore accuracy of the model.

 

Plot 5:

 

Summaries/Regressions:

## 
## Call:
## lm(formula = RE$price_per_unit_area ~ RE$house_age + RE$distance_MRT + 
##     RE$stores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.304  -5.430  -1.738   4.325  77.315 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     42.977286   1.384542  31.041  < 2e-16 ***
## RE$house_age    -0.252856   0.040105  -6.305 7.47e-10 ***
## RE$distance_MRT -0.005379   0.000453 -11.874  < 2e-16 ***
## RE$stores        1.297443   0.194290   6.678 7.91e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.251 on 410 degrees of freedom
## Multiple R-squared:  0.5411, Adjusted R-squared:  0.5377 
## F-statistic: 161.1 on 3 and 410 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = RE$price_per_unit_area ~ RE$distance_MRT + RE$stores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.515  -5.862  -1.358   4.782  78.588 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     39.1229027  1.2995071  30.106  < 2e-16 ***
## RE$distance_MRT -0.0055780  0.0004728 -11.799  < 2e-16 ***
## RE$stores        1.1975990  0.2025665   5.912 7.11e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.678 on 411 degrees of freedom
## Multiple R-squared:  0.4966, Adjusted R-squared:  0.4941 
## F-statistic: 202.7 on 2 and 411 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = RE$price_per_unit_area ~ RE$distance_MRT + RE$house_age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.032  -4.742  -1.037   4.533  71.930 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     49.8855858  0.9677644  51.547  < 2e-16 ***
## RE$distance_MRT -0.0072086  0.0003795 -18.997  < 2e-16 ***
## RE$house_age    -0.2310266  0.0420383  -5.496 6.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.73 on 411 degrees of freedom
## Multiple R-squared:  0.4911, Adjusted R-squared:  0.4887 
## F-statistic: 198.3 on 2 and 411 DF,  p-value: < 2.2e-16