Table of Contents

Subtasks

Signatures

Abbreviations and Definitions

  1. Introduction

  2. Descriptive Statistics

  3. Methods

  4. Analysis

  5. Results

  6. Conclusion

  7. Mapping

Subtasks

Andrew: Final Model Building and Reporting, HTML Presentation Creation, SAP

Brian: Model/Residual Diagnostics, SAP

Julia: Descriptive Statistics Building and Writing, Report Conclusion, Analysis Plan Writing, SAP

Mikaela: Model testing, Report introduction, SAP

Signatures

Abbreviations and Definitions

1. Introduction

The data used for this report was taken to see how the price of houses in Taipei, Taiwan were affected by 6 factors based on time and location. The data set consists of 8 variables, one of which is an ID variable, and 414 independent house observations. House price of unit area acts as the dependent variable (Y). The other six variables act as independent variables that will potentially affect the price of houses. These variables include the year of the transaction (X1), the age of house (X2), the distance from the house to the nearest MRT station (X3), the number of convenience stores nearby (X4), and the house’s latitude (X5) and longitude (X6). Transaction acts as a categorical variable since it can only have two values: 2012 and 2013.

Following this section are detailed summaries of the pre-analysis performed before starting the modeling process, which modeling methods were explored, the reasoning of why these methods were chosen, and the conclusions made based on the models.

2. Descriptive Statistics

In order to examine our variables more closely, we began by performing some descriptive statistics techniques. To start, we viewed the first six observations of the data. This helped us visualize how our data looks at a glance to see what kinds of steps should be taken next. We then looked at the structure of the data which described our observation count (414 observations), variable count (8 variables), and our types of variables (all numerical). We then examined the variables’ means, medians, standard deviations, minimum and maximum values, ranges, and standard errors. This information helps us discover the basics about our data and potential relationships we may discover. We also use this information to determine how we may go about combining our longitude and latitude variables into one variable, which will be named region.

3. Methods

Simple linear regression was used as a starting point for the analysis as it is simple to understand and interpret, but as we found out the y variable distribution was skewing to the right, other regression techniques were explored. Starting off with the box cox transformation in an attempt to normalize the distribution of the Price Unit Area variable in the model, the method was found to be not very effective. In addition to this, the box cox transformation method is known to make interpretation of the model more difficult, contributing to the reason why we tried a different method type of box cox transformation. The method we used in an attempt for a better model was the log transformation, which is similar to the box cox method but makes λ = 0.

4. Analysis

Creating new variable: Region We first combined Latitude and Longitude to create a new variable Region, which will be a categorical variable that helps categorize houses by location. The houses are separated into 4 different groups.

## Warning: package 'dplyr' was built under R version 4.1.3

Initial Model

## 
## Call:
## lm(formula = PriceUnitArea ~ TransactionYear + HouseAge + Distance2MRT + 
##     NumConvenStores + Region, data = newdata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.077  -5.362  -0.843   3.612  73.870 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -5.897e+03  1.860e+03  -3.171  0.00164 ** 
## TransactionYear  2.954e+00  9.240e-01   3.197  0.00150 ** 
## HouseAge        -3.039e-01  3.802e-02  -7.995 1.35e-14 ***
## Distance2MRT    -3.319e-03  5.126e-04  -6.474 2.74e-10 ***
## NumConvenStores  1.276e+00  1.814e-01   7.031 8.68e-12 ***
## Region          -3.324e+00  4.630e-01  -7.181 3.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.639 on 408 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5969 
## F-statistic: 123.3 on 5 and 408 DF,  p-value: < 2.2e-16

The line on the residuals plot is fairly horizontal showing potential for a linear relationship among the residuals. But after checking the distribution of the y variable (PriceUnitArea), the distribution shows a skew to the right. Plus, a couple of points on the Normal Q-Q plot do not all fall on the line. Therefore we will try a Box cox transformation to try and align the non-normal data.

Box-Cox Transformation

The issue with the box cox transformation is that when used, it doe not make the analysis easy to understand, so therefore should be avoided when interpreting is the end goal of that analysis.

Since 0 and 1 are not within the 95% confidence interval of λ, it indicates that performing the power transformation is not a good idea. By the optimal λ is closer to 0.25, we will perform the log transformation on the model.

Log-transformed Model

Summarized statistics of the regression coefficients of the model with the log-transformed response
Estimate Std. Error t value Pr(>|t|)
(Intercept) -154.0639110 47.6720803 -3.231743 0.0013302
TransactionYear 0.0784862 0.0236861 3.313599 0.0010033
HouseAge -0.0077758 0.0009746 -7.978716 0.0000000
Distance2MRT -0.0001312 0.0000131 -9.982173 0.0000000
NumConvenStores 0.0323972 0.0046514 6.965093 0.0000000
Region -0.0886254 0.0118679 -7.467637 0.0000000

Interactions Model

We also ran a model using the inteeractions between the independent variables used in the previous models.

Summarized statistics of the regression coefficients of the final model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9350.9548166 1.191741e+04 -0.7846465 0.4331384
TransactionYear 4.6770133 5.921769e+00 0.7898001 0.4301257
HouseAge -489.5002643 5.526223e+02 -0.8857773 0.3762848
Distance2MRT 11.8474949 7.106405e+00 1.6671572 0.0962880
NumConvenStores -634.5100481 2.326380e+03 -0.2727457 0.7851934
Region -9663.6620059 8.370204e+03 -1.1545312 0.2489912
TransactionYear:HouseAge 0.2429882 2.745875e-01 0.8849209 0.3767460
TransactionYear:Distance2MRT -0.0058925 3.530700e-03 -1.6689236 0.0959371
HouseAge:Distance2MRT -0.6693656 3.248624e-01 -2.0604590 0.0400192
TransactionYear:NumConvenStores 0.3157679 1.155913e+00 0.2731761 0.7848628
HouseAge:NumConvenStores 81.3914265 1.045307e+02 0.7786367 0.4366672
Distance2MRT:NumConvenStores 1.3880945 9.664889e-01 1.4362240 0.1517423
TransactionYear:Region 4.7979310 4.158845e+00 1.1536691 0.2493442
HouseAge:Region 722.5322840 3.899953e+02 1.8526690 0.0646870
Distance2MRT:Region 0.0024765 1.048700e-03 2.3615163 0.0186925
NumConvenStores:Region 1983.1580654 1.480535e+03 1.3394877 0.1811940
TransactionYear:HouseAge:Distance2MRT 0.0003328 1.614000e-04 2.0619781 0.0398735
TransactionYear:HouseAge:NumConvenStores -0.0404500 5.193740e-02 -0.7788227 0.4365577
TransactionYear:Distance2MRT:NumConvenStores -0.0006917 4.802000e-04 -1.4404044 0.1505574
TransactionYear:HouseAge:Region -0.3590501 1.937761e-01 -1.8529123 0.0646521
HouseAge:Distance2MRT:Region -0.0000869 5.160000e-05 -1.6852280 0.0927466
TransactionYear:NumConvenStores:Region -0.9852552 7.356254e-01 -1.3393436 0.1812408
HouseAge:NumConvenStores:Region -126.6833895 6.875142e+01 -1.8426294 0.0661439
Distance2MRT:NumConvenStores:Region 0.0008000 2.068000e-04 3.8684988 0.0001284
TransactionYear:HouseAge:NumConvenStores:Region 0.0629603 3.416000e-02 1.8430987 0.0660752

Choosing the Optimal Model

Coefficients of correlation of the three candidate models
model model.transform model.final
0.6017756 0.6854765 0.695547

The interaction model has the highest R^2 of 69.56%. The log-transformed model has the second biggest R^2 of 68.55%. The initial model has the third highest R^2 of 60.18%. The interpretation of the final model is not as straightforward as the others due to all the insignificant interactions. I believe the log-transformed model is the optimal model to report since it has a high R^2 value, a simple structure, all significant p-values, and is easy to interpret.

5. Results

The summarized statistics for our chosen model are given in the following table

Summary of the final working model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -154.0639110 47.6720803 -3.231743 0.0013302
TransactionYear 0.0784862 0.0236861 3.313599 0.0010033
HouseAge -0.0077758 0.0009746 -7.978716 0.0000000
Distance2MRT -0.0001312 0.0000131 -9.982173 0.0000000
NumConvenStores 0.0323972 0.0046514 6.965093 0.0000000
Region -0.0886254 0.0118679 -7.467637 0.0000000

In conclusion, all independent variables, including Region, are statistically significant (P-value < alpha=0.05). TransactionYear and the Number of Convenience Stores variables are positively correlated while all other variables are negatively correlated.

6. Conclusion

In this analysis, we aimed to discover how the five independent variables, transaction year, house age, distance to MRT station, number of convenience stores, and region, affect the price of houses in Taipei, Taiwan. We discovered that all of the variables were significant when predicting the Taipei home prices, some showing a positive correlation while others showing a negative correlation. Specifically, transaction year and number of convenience stores have positive relationships with house price, while house age, distance to MRT station, and region have negative relationships with house price. Using the log-transformed model as our final model for this analysis allowed us to view our distributions less skewed and make simple interpretations. This analysis has granted us insight on the different ways that house prices can be predicted. However, with more research, future analyses could be examined using other variables such as home size, upgrades and updates, interest rates, etc. It would be beneficial to see how new factors may interact to determine the price of a home in Taipei.

7. Mapping

## Warning: package 'leaflet' was built under R version 4.1.3

This map shows the locations, using latitude and longitude, of all the houses in Taipei that were sampled. This map helped to create the new Region variable.