Subtasks
Signatures
Abbreviations and Definitions
Introduction
Descriptive Statistics
Methods
Analysis
Results
Conclusion
Mapping
Andrew: Final Model Building and Reporting, HTML Presentation Creation, SAP
Brian: Model/Residual Diagnostics, SAP
Julia: Descriptive Statistics Building and Writing, Report Conclusion, Analysis Plan Writing, SAP
Mikaela: Model testing, Report introduction, SAP
The data used for this report was taken to see how the price of houses in Taipei, Taiwan were affected by 6 factors based on time and location. The data set consists of 8 variables, one of which is an ID variable, and 414 independent house observations. House price of unit area acts as the dependent variable (Y). The other six variables act as independent variables that will potentially affect the price of houses. These variables include the year of the transaction (X1), the age of house (X2), the distance from the house to the nearest MRT station (X3), the number of convenience stores nearby (X4), and the house’s latitude (X5) and longitude (X6). Transaction acts as a categorical variable since it can only have two values: 2012 and 2013.
Following this section are detailed summaries of the pre-analysis performed before starting the modeling process, which modeling methods were explored, the reasoning of why these methods were chosen, and the conclusions made based on the models.
In order to examine our variables more closely, we began by performing some descriptive statistics techniques. To start, we viewed the first six observations of the data. This helped us visualize how our data looks at a glance to see what kinds of steps should be taken next. We then looked at the structure of the data which described our observation count (414 observations), variable count (8 variables), and our types of variables (all numerical). We then examined the variables’ means, medians, standard deviations, minimum and maximum values, ranges, and standard errors. This information helps us discover the basics about our data and potential relationships we may discover. We also use this information to determine how we may go about combining our longitude and latitude variables into one variable, which will be named region.
Simple linear regression was used as a starting point for the analysis as it is simple to understand and interpret, but as we found out the y variable distribution was skewing to the right, other regression techniques were explored. Starting off with the box cox transformation in an attempt to normalize the distribution of the Price Unit Area variable in the model, the method was found to be not very effective. In addition to this, the box cox transformation method is known to make interpretation of the model more difficult, contributing to the reason why we tried a different method type of box cox transformation. The method we used in an attempt for a better model was the log transformation, which is similar to the box cox method but makes λ = 0.
Creating new variable: Region We first combined Latitude and Longitude to create a new variable Region, which will be a categorical variable that helps categorize houses by location. The houses are separated into 4 different groups.
## Warning: package 'dplyr' was built under R version 4.1.3
Initial Model
##
## Call:
## lm(formula = PriceUnitArea ~ TransactionYear + HouseAge + Distance2MRT +
## NumConvenStores + Region, data = newdata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.077 -5.362 -0.843 3.612 73.870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.897e+03 1.860e+03 -3.171 0.00164 **
## TransactionYear 2.954e+00 9.240e-01 3.197 0.00150 **
## HouseAge -3.039e-01 3.802e-02 -7.995 1.35e-14 ***
## Distance2MRT -3.319e-03 5.126e-04 -6.474 2.74e-10 ***
## NumConvenStores 1.276e+00 1.814e-01 7.031 8.68e-12 ***
## Region -3.324e+00 4.630e-01 -7.181 3.31e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.639 on 408 degrees of freedom
## Multiple R-squared: 0.6018, Adjusted R-squared: 0.5969
## F-statistic: 123.3 on 5 and 408 DF, p-value: < 2.2e-16
The line on the residuals plot is fairly horizontal showing potential for a linear relationship among the residuals. But after checking the distribution of the y variable (PriceUnitArea), the distribution shows a skew to the right. Plus, a couple of points on the Normal Q-Q plot do not all fall on the line. Therefore we will try a Box cox transformation to try and align the non-normal data.
Box-Cox Transformation
The issue with the box cox transformation is that when used, it doe not make the analysis easy to understand, so therefore should be avoided when interpreting is the end goal of that analysis.
Since 0 and 1 are not within the 95% confidence interval of λ, it indicates that performing the power transformation is not a good idea. By the optimal λ is closer to 0.25, we will perform the log transformation on the model.
Log-transformed Model
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -154.0639110 | 47.6720803 | -3.231743 | 0.0013302 |
| TransactionYear | 0.0784862 | 0.0236861 | 3.313599 | 0.0010033 |
| HouseAge | -0.0077758 | 0.0009746 | -7.978716 | 0.0000000 |
| Distance2MRT | -0.0001312 | 0.0000131 | -9.982173 | 0.0000000 |
| NumConvenStores | 0.0323972 | 0.0046514 | 6.965093 | 0.0000000 |
| Region | -0.0886254 | 0.0118679 | -7.467637 | 0.0000000 |
Interactions Model
We also ran a model using the inteeractions between the independent variables used in the previous models.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -9350.9548166 | 1.191741e+04 | -0.7846465 | 0.4331384 |
| TransactionYear | 4.6770133 | 5.921769e+00 | 0.7898001 | 0.4301257 |
| HouseAge | -489.5002643 | 5.526223e+02 | -0.8857773 | 0.3762848 |
| Distance2MRT | 11.8474949 | 7.106405e+00 | 1.6671572 | 0.0962880 |
| NumConvenStores | -634.5100481 | 2.326380e+03 | -0.2727457 | 0.7851934 |
| Region | -9663.6620059 | 8.370204e+03 | -1.1545312 | 0.2489912 |
| TransactionYear:HouseAge | 0.2429882 | 2.745875e-01 | 0.8849209 | 0.3767460 |
| TransactionYear:Distance2MRT | -0.0058925 | 3.530700e-03 | -1.6689236 | 0.0959371 |
| HouseAge:Distance2MRT | -0.6693656 | 3.248624e-01 | -2.0604590 | 0.0400192 |
| TransactionYear:NumConvenStores | 0.3157679 | 1.155913e+00 | 0.2731761 | 0.7848628 |
| HouseAge:NumConvenStores | 81.3914265 | 1.045307e+02 | 0.7786367 | 0.4366672 |
| Distance2MRT:NumConvenStores | 1.3880945 | 9.664889e-01 | 1.4362240 | 0.1517423 |
| TransactionYear:Region | 4.7979310 | 4.158845e+00 | 1.1536691 | 0.2493442 |
| HouseAge:Region | 722.5322840 | 3.899953e+02 | 1.8526690 | 0.0646870 |
| Distance2MRT:Region | 0.0024765 | 1.048700e-03 | 2.3615163 | 0.0186925 |
| NumConvenStores:Region | 1983.1580654 | 1.480535e+03 | 1.3394877 | 0.1811940 |
| TransactionYear:HouseAge:Distance2MRT | 0.0003328 | 1.614000e-04 | 2.0619781 | 0.0398735 |
| TransactionYear:HouseAge:NumConvenStores | -0.0404500 | 5.193740e-02 | -0.7788227 | 0.4365577 |
| TransactionYear:Distance2MRT:NumConvenStores | -0.0006917 | 4.802000e-04 | -1.4404044 | 0.1505574 |
| TransactionYear:HouseAge:Region | -0.3590501 | 1.937761e-01 | -1.8529123 | 0.0646521 |
| HouseAge:Distance2MRT:Region | -0.0000869 | 5.160000e-05 | -1.6852280 | 0.0927466 |
| TransactionYear:NumConvenStores:Region | -0.9852552 | 7.356254e-01 | -1.3393436 | 0.1812408 |
| HouseAge:NumConvenStores:Region | -126.6833895 | 6.875142e+01 | -1.8426294 | 0.0661439 |
| Distance2MRT:NumConvenStores:Region | 0.0008000 | 2.068000e-04 | 3.8684988 | 0.0001284 |
| TransactionYear:HouseAge:NumConvenStores:Region | 0.0629603 | 3.416000e-02 | 1.8430987 | 0.0660752 |
Choosing the Optimal Model
| model | model.transform | model.final |
|---|---|---|
| 0.6017756 | 0.6854765 | 0.695547 |
The interaction model has the highest R^2 of 69.56%. The log-transformed model has the second biggest R^2 of 68.55%. The initial model has the third highest R^2 of 60.18%. The interpretation of the final model is not as straightforward as the others due to all the insignificant interactions. I believe the log-transformed model is the optimal model to report since it has a high R^2 value, a simple structure, all significant p-values, and is easy to interpret.
The summarized statistics for our chosen model are given in the following table
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -154.0639110 | 47.6720803 | -3.231743 | 0.0013302 |
| TransactionYear | 0.0784862 | 0.0236861 | 3.313599 | 0.0010033 |
| HouseAge | -0.0077758 | 0.0009746 | -7.978716 | 0.0000000 |
| Distance2MRT | -0.0001312 | 0.0000131 | -9.982173 | 0.0000000 |
| NumConvenStores | 0.0323972 | 0.0046514 | 6.965093 | 0.0000000 |
| Region | -0.0886254 | 0.0118679 | -7.467637 | 0.0000000 |
In conclusion, all independent variables, including Region, are statistically significant (P-value < alpha=0.05). TransactionYear and the Number of Convenience Stores variables are positively correlated while all other variables are negatively correlated.
In this analysis, we aimed to discover how the five independent variables, transaction year, house age, distance to MRT station, number of convenience stores, and region, affect the price of houses in Taipei, Taiwan. We discovered that all of the variables were significant when predicting the Taipei home prices, some showing a positive correlation while others showing a negative correlation. Specifically, transaction year and number of convenience stores have positive relationships with house price, while house age, distance to MRT station, and region have negative relationships with house price. Using the log-transformed model as our final model for this analysis allowed us to view our distributions less skewed and make simple interpretations. This analysis has granted us insight on the different ways that house prices can be predicted. However, with more research, future analyses could be examined using other variables such as home size, upgrades and updates, interest rates, etc. It would be beneficial to see how new factors may interact to determine the price of a home in Taipei.
## Warning: package 'leaflet' was built under R version 4.1.3
This map shows the locations, using latitude and longitude, of all the houses in Taipei that were sampled. This map helped to create the new Region variable.