class: center, middle, inverse, title-slide .title[ #
Taipei Real Estate House Prices
] .subtitle[ ##
EDA and Linear Regression Analysis
] .author[ ###
Andrew Heneghan
Julia Randazzo
Boning Liu
Mikaela Taylor
] .institute[ ###
West Chester University of Pennsylvania
] .date[ ###
12/15/2022
STA 490: Capstone Statistics
] --- class: middle, center # <font color = "red">Agenda</font> <font size = 5>Analysis Subtasks and Research Question</font><br> <font size = 5>Descriptive Statistics</font><br> <font size = 5>Region Variable and Mapping</font><br> <font size = 5>Model Diagnostics</font><br> <font size = 5>Box-Cox Transformation</font><br> <font size = 5>New Candidate Models (Log-Transformed/Interactions)</font><br> <font size = 5>Choosing an Optimal Model</font><br> <font size = 5>Conclusion</font><br> --- class: middle, center ## <center>Subtasks</center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/BrianL0201/BrianL0201.github.io/main/Subtasks.png" width="100%" style="display: block; margin: auto;" /> --- class: middle, center ## Research Question ### How do certain aspects of houses in Taipei affect house price of unit area? <img src = "https://azheneghan.github.io/aheneghan/images/House_Prices.png" width="250" height="220"> --- class: top ## <center>Description of Data</center> .pull-left[ - Data collected of houses in Taipei, Taiwan by real estate investors. - 414 independent houses in or near Taipei were randomly sampled. - Consists of 8 variables, including an ID variable. - House price of unit area acts as the dependent variable. - Transaction date is categorical since it can only be 2012 and 2013. ] .pull-right[ - The independent variables are: + year of the transaction date (X1) + house age (X2) + distance from house to nearest MRT station (X3) + number of nearby convenience stores (X4) + latitude (X5) + longitude (X6). <img src = "https://azheneghan.github.io/aheneghan/images/taipei.png" width="600" height="250"> ] --- class: top ## <center>Descriptive Statistics</center> - We viewed the first six observations of the data. - We then looked at the data structure which described observation count (414), variable count (8), and our types of variables (all numerical except TransactionYear). - We then examined the variables’ means, medians, standard deviations, minimum and maximum values, ranges, and standard errors. - We also made a histogram the distribution of values of the response variable, house price of unit area. <img src="data:image/png;base64,#Taipei-Realestate-Linear-Regression-HTML-Presentation_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: top ## <center>New Region Variable and Mapping</center> We combined the Longitude and Latitude variables into one variable, called Region. This map shows the locations, using latitude and longitude, of all the houses in Taipei that were sampled. This map helped to create the new Region variable. <center>
</center> --- class: top ## <center>Diagnostics</center> .pull-left[ <img src="data:image/png;base64,#Taipei-Realestate-Linear-Regression-HTML-Presentation_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] .pull-right[ - All independent variables are linearly correlated with house price of unit area. - All independent variables are not linear correlated with each other, so there likely is no collinearity. - There is one outlier (Obs. 271) in the data. This house observation has a value of 117.5. ] --- class: top ## <center>Diagnostics (cont'd)</center> .pull-left[ <img src="data:image/png;base64,#Taipei-Realestate-Linear-Regression-HTML-Presentation_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] .pull-right[ - <font size = 4 color = "black">The lines on the Residuals vs Fitted and Scale-Location graphs are horizontal. Points on the Scale-Location graph are evenly spread. The relationship of the residuals is linear and there is homogeneity of variances.</font><br> - <font size = 4 color = "black">Some of the points on the Normal QQ plot are not on the line. Therefore, the residuals are not normally distributed.</font> ] --- class: middle ## <center>Box-Cox Transformation</center> <img src="data:image/png;base64,#Taipei-Realestate-Linear-Regression-HTML-Presentation_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> - When this transformation is used, it does not make the analysis easy to understand. - Since 0 and 1 are not within the 95% confidence interval of λ, it indicates that performing the power transformation is not a good idea. - Since the optimal λ is closer to 0.25, we will perform a log transformation on the model. </font> --- class: top ## <center>Log-Transformed Model</center> .pull-left[ - More points now fall on the line for the Normal QQ Plot compared to for the initial model.<br> - The line on the Scale-Location is a little more slanted compared to the initial model, but points are still spread evenly. ] .pull-right[ <img src="data:image/png;base64,#Taipei-Realestate-Linear-Regression-HTML-Presentation_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- class: middle, center ## <center>Interactions Model</center> <font size = 3 color = "black">Since the interaction effects are insignificant, we used the automatic variable selection method to find the final model.<br>The resulting residual plots were similar to the initial model.</font> <img src="data:image/png;base64,#Taipei-Realestate-Linear-Regression-HTML-Presentation_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- class: middle, center ##<center>Interactions Model Summary</center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/BrianL0201/BrianL0201.github.io/main/Summarized%20Statistics.png" width="100%" style="display: block; margin: auto;" /> --- class: middle, center ## <center>Choosing an Optimal Model</center> Table: Coefficients of correlation of the three candidate models | model| model.transform| model.final| |--------:|---------------:|-----------:| | 0.604729| 0.6890686| 0.7051469| <br><font size = 4 color = "black">The interaction model has the highest R^2 of 69.56%. The log-transformed model has the second biggest R^2 of 68.55%. The initial model has the third highest R^2 of 60.18%. The interpretation of the final model is not as straightforward as the others due to all the insignificant interactions. We believe the log-transformed model is the optimal model to report since it has a high R^2 value, a simple structure, all significant p-values, and is easy to interpret. </font> --- class: middle, center ## <center>Results</center> Table: Summary of the final working model | | Estimate| Std. Error| t value| Pr(>|t|)| |:---------------|------------:|----------:|---------:|------------------:| |(Intercept) | -157.1661789| 47.6825588| -3.296094| 0.0010666| |TransactionYear | 0.0799961| 0.0236917| 3.376540| 0.0008048| |HouseAge | -0.0076916| 0.0009731| -7.904176| 0.0000000| |Distance2MRT | -0.0001552| 0.0000173| -8.947899| 0.0000000| |NumConvenStores | 0.0303354| 0.0047710| 6.358351| 0.0000000| |Region2 | -0.0864295| 0.0545576| -1.584188| 0.1139294| |Region3 | -0.2044123| 0.0268185| -7.622074| 0.0000000| |Region4 | -0.1883889| 0.0513857| -3.666172| 0.0002789| <font size = 4 color = "black">All independent variables, except Region2, are statistically significant (P-value < alpha=0.05) towards the model. TransactionYear and the Number of Convenience Stores variables are positively correlated while all other variables are negatively correlated.</font> --- class: center ## <center>Conclusion</center> - We aimed to see how transaction year, house age, distance to MRT station, number of convenience stores, and region, affect the price of houses in Taipei, Taiwan. - Most variables were significant when predicting the Taipei home prices, except for Region2. - Transaction year and number of convenience stores have positive relationships with house price, while house age, distance to MRT station, and region have negative relationships with house price. - The log-transformed model worked best as the final model for a linear regression analysis of this data. - It would be beneficial to see how new factors, such as home size, upgrades and updates, interest rates, etc., may interact to determine the price of a home in Taipei. --- class: middle, center ## <center>References</center> Sibanda, Hastings. “Taipei Housing Dataset UCI.” Kaggle, 30 Jan. 2022, https://www.kaggle.com/datasets/hastingssibanda/taipei-housing-dataset-uci. --- class: middle, center <img src="data:image/png;base64,#https://raw.githubusercontent.com/BrianL0201/BrianL0201.github.io/main/Team%20Signatures.png" width="75%" style="display: block; margin: auto;" />