Housing search results

library(dplyr) 
library(rvest)   
library(stringr)
library(ggplot2)

I was not able to do this in R. The code I wrote turned out quite whimsical and would return either the right values or empty characters (seemingly) at random. Eventually, I decided to scrape the data using Python by following this guide ( code here). After some more debugging, I was able to parse the webpages into MULTIPLE csvs (iteration didn’t work here either)

I’m using houses for sale in Denver, CO with 2+ bedrooms and 1+ bathrooms. Since I iterated through the pages manually, I’m only using results from the first four pages.

# dummy R code 
someLink<- read_html("link") 
price <- someLink %>% html_nodes('.list-card-price') %>% html_text()
details <- someLink %>% html_nodes('.list-card-details') %>% html_text() 
# repeat manually for multiple pages, or write a loop 
# import csv files
temp1 <- read.csv("prop1.csv")
temp2 <- read.csv("prop2.csv")
temp3 <- read.csv("prop3.csv")
temp4 <- read.csv("prop4.csv")

# merge into a dataset
houses <- rbind(temp1, temp2, temp3, temp4)
head(houses, 3)
##            title address   city state postal_code    price
## 1 House for sale      NA Denver    CO       80204 $450,000
## 2 House for sale      NA Denver    CO       80219 $400,000
## 3 House for sale      NA Denver    CO       80210 $729,900
##         facts.and.features        real.estate.provider
## 1 2 bds, 2.0 ba ,1247 sqft        RE/MAX Professionals
## 2  2 bds, 1.0 ba ,728 sqft Equity Colorado Real Estate
## 3 3 bds, 2.0 ba ,1922 sqft   Key Real Estate Group LLC
##                                                                                    url
## 1        https://www.zillow.com/homedetails/634-King-St-Denver-CO-80204/13340690_zpid/
## 2  https://www.zillow.com/homedetails/2315-S-Hooker-Way-Denver-CO-80219/13380314_zpid/
## 3 https://www.zillow.com/homedetails/2266-S-Humboldt-St-Denver-CO-80210/13376243_zpid/

Some cleaning

colnames(houses)
## [1] "title"                "address"              "city"                
## [4] "state"                "postal_code"          "price"               
## [7] "facts.and.features"   "real.estate.provider" "url"
# see what form the information is in 
head(houses$facts.and.features)
## [1] 2 bds, 2.0 ba ,1247 sqft 2 bds, 1.0 ba ,728 sqft  3 bds, 2.0 ba ,1922 sqft
## [4] 4 bds, 3.0 ba ,2326 sqft 3 bds, 2.0 ba ,1211 sqft 4 bds, 4.0 ba ,3400 sqft
## 159 Levels: 2 bds, 1.0 ba ,1065 sqft ... 6 bds, 4.0 ba ,4187 sqft
# identify index and extract values using strsplit
houses <- houses %>% mutate(facts.and.features = as.character(facts.and.features),
                            price = as.numeric(str_replace_all(price, "[^0-9]*", "")))
for (i in 1:nrow(houses)){

houses$beds[i] = as.numeric(strsplit(houses$facts.and.features, " ")[[i]][1])
houses$baths[i] =as.numeric(strsplit(houses$facts.and.features," ")[[i]][3])
houses$sqFt[i] =as.numeric(strsplit(strsplit(houses$facts.and.features, " ")[[i]][5], ",")[[1]][2])
}

Visualization

# not sure about informativeness but here is a plot of house prices across postal codes 

houses %>% select(beds, baths, sqFt, price, postal_code) %>% ggplot(aes(factor(postal_code), price, color = factor(beds))) +geom_point()+
  theme(axis.text.x = element_text(angle = 90)) + labs(x = "postal code") + scale_color_discrete(name = "Number of bedrooms")

Regression Analysis

summary(lm(price ~ beds + baths + sqFt, houses))
## 
## Call:
## lm(formula = price ~ beds + baths + sqFt, data = houses)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1059639  -173336    -2943   133256  1726320 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  257996.58   90481.42   2.851  0.00494 ** 
## beds         -74372.00   32750.28  -2.271  0.02452 *  
## baths       -106517.53   40902.19  -2.604  0.01010 *  
## sqFt            485.79      42.51  11.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 319600 on 156 degrees of freedom
## Multiple R-squared:  0.6429, Adjusted R-squared:  0.636 
## F-statistic: 93.61 on 3 and 156 DF,  p-value: < 2.2e-16
summary(lm(price ~ beds + baths + sqFt + factor(postal_code), houses))
## 
## Call:
## lm(formula = price ~ beds + baths + sqFt + factor(postal_code), 
##     data = houses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -660476 -112371   -6859   94183 1610265 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -78026.11  213100.74  -0.366   0.7148    
## beds                      -79171.70   31340.60  -2.526   0.0127 *  
## baths                       6313.64   40550.07   0.156   0.8765    
## sqFt                         397.79      40.52   9.817   <2e-16 ***
## factor(postal_code)80203  769494.05  344107.66   2.236   0.0270 *  
## factor(postal_code)80204  124229.00  222965.37   0.557   0.5784    
## factor(postal_code)80205  310629.15  221289.39   1.404   0.1627    
## factor(postal_code)80206  510379.28  212139.52   2.406   0.0175 *  
## factor(postal_code)80207  287392.61  226404.81   1.269   0.2065    
## factor(postal_code)80209  468572.14  211071.49   2.220   0.0281 *  
## factor(postal_code)80210  419897.68  201904.61   2.080   0.0395 *  
## factor(postal_code)80211  429761.56  210195.19   2.045   0.0429 *  
## factor(postal_code)80212  337384.38  213430.81   1.581   0.1163    
## factor(postal_code)80218  542654.15  270245.36   2.008   0.0467 *  
## factor(postal_code)80219  183689.42  204317.00   0.899   0.3703    
## factor(postal_code)80220  359757.63  205300.53   1.752   0.0820 .  
## factor(postal_code)80221  280384.86  248874.92   1.127   0.2620    
## factor(postal_code)80222  190246.96  226030.78   0.842   0.4015    
## factor(postal_code)80223  343985.33  249096.67   1.381   0.1696    
## factor(postal_code)80230   77290.80  333609.55   0.232   0.8171    
## factor(postal_code)80231 -192699.81  222696.17  -0.865   0.3884    
## factor(postal_code)80236  206581.55  246598.48   0.838   0.4037    
## factor(postal_code)80237   -2731.21  216859.89  -0.013   0.9900    
## factor(postal_code)80238   18107.75  216759.36   0.084   0.9335    
## factor(postal_code)80239   58199.60  226402.32   0.257   0.7975    
## factor(postal_code)80246  439988.90  272515.07   1.615   0.1088    
## factor(postal_code)80247 -149401.80  330540.88  -0.452   0.6520    
## factor(postal_code)80249 -146036.21  210624.39  -0.693   0.4893    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269500 on 132 degrees of freedom
## Multiple R-squared:  0.7853, Adjusted R-squared:  0.7413 
## F-statistic: 17.88 on 27 and 132 DF,  p-value: < 2.2e-16

People in Denver do not enjoy having extra bedrooms and bathrooms, it seems. Having one more bedroom was associated with about a ~75K decrease in price and having one more bathroom with a ~106K decrease! They do, however, seem to value extra space. The coefficient on squareFootage is positive and statistically significant (p < 0.001). Just don’t use it on comfort! Including postal-code fixed effects shows bathrooms may have some value after all (coeff est. = 6313.64, not statistically significant at usual levels) but people don’t care for bedrooms (coeff est. = -79171, p = 0.0127). Adding postal_code dummies reduces standard errors (somewhat) and improves R-sq (64.3% to 78.5%)

Improvements

Some other house characteristics might be of interest, eg. Year Built, Major renovations etc. Zillow has information on year of construction too. Other factors might be access to important sites (recreational, business-related) etc. Also, perhaps crime rates, high-school graduation rates etc. although all of these are probably captured by the zipcode.

The use of GIS datasets, especially one spanning multiple decades, might be informative as well.