library(dplyr)
library(rvest)
library(stringr)
library(ggplot2)
I was not able to do this in R. The code I wrote turned out quite whimsical and would return either the right values or empty characters (seemingly) at random. Eventually, I decided to scrape the data using Python by following this guide ( code here). After some more debugging, I was able to parse the webpages into MULTIPLE csvs (iteration didn’t work here either)
I’m using houses for sale in Denver, CO with 2+ bedrooms and 1+ bathrooms. Since I iterated through the pages manually, I’m only using results from the first four pages.
# dummy R code
someLink<- read_html("link")
price <- someLink %>% html_nodes('.list-card-price') %>% html_text()
details <- someLink %>% html_nodes('.list-card-details') %>% html_text()
# repeat manually for multiple pages, or write a loop
# import csv files
temp1 <- read.csv("prop1.csv")
temp2 <- read.csv("prop2.csv")
temp3 <- read.csv("prop3.csv")
temp4 <- read.csv("prop4.csv")
# merge into a dataset
houses <- rbind(temp1, temp2, temp3, temp4)
head(houses, 3)
## title address city state postal_code price
## 1 House for sale NA Denver CO 80204 $450,000
## 2 House for sale NA Denver CO 80219 $400,000
## 3 House for sale NA Denver CO 80210 $729,900
## facts.and.features real.estate.provider
## 1 2 bds, 2.0 ba ,1247 sqft RE/MAX Professionals
## 2 2 bds, 1.0 ba ,728 sqft Equity Colorado Real Estate
## 3 3 bds, 2.0 ba ,1922 sqft Key Real Estate Group LLC
## url
## 1 https://www.zillow.com/homedetails/634-King-St-Denver-CO-80204/13340690_zpid/
## 2 https://www.zillow.com/homedetails/2315-S-Hooker-Way-Denver-CO-80219/13380314_zpid/
## 3 https://www.zillow.com/homedetails/2266-S-Humboldt-St-Denver-CO-80210/13376243_zpid/
colnames(houses)
## [1] "title" "address" "city"
## [4] "state" "postal_code" "price"
## [7] "facts.and.features" "real.estate.provider" "url"
# see what form the information is in
head(houses$facts.and.features)
## [1] 2 bds, 2.0 ba ,1247 sqft 2 bds, 1.0 ba ,728 sqft 3 bds, 2.0 ba ,1922 sqft
## [4] 4 bds, 3.0 ba ,2326 sqft 3 bds, 2.0 ba ,1211 sqft 4 bds, 4.0 ba ,3400 sqft
## 159 Levels: 2 bds, 1.0 ba ,1065 sqft ... 6 bds, 4.0 ba ,4187 sqft
# identify index and extract values using strsplit
houses <- houses %>% mutate(facts.and.features = as.character(facts.and.features),
price = as.numeric(str_replace_all(price, "[^0-9]*", "")))
for (i in 1:nrow(houses)){
houses$beds[i] = as.numeric(strsplit(houses$facts.and.features, " ")[[i]][1])
houses$baths[i] =as.numeric(strsplit(houses$facts.and.features," ")[[i]][3])
houses$sqFt[i] =as.numeric(strsplit(strsplit(houses$facts.and.features, " ")[[i]][5], ",")[[1]][2])
}
# not sure about informativeness but here is a plot of house prices across postal codes
houses %>% select(beds, baths, sqFt, price, postal_code) %>% ggplot(aes(factor(postal_code), price, color = factor(beds))) +geom_point()+
theme(axis.text.x = element_text(angle = 90)) + labs(x = "postal code") + scale_color_discrete(name = "Number of bedrooms")
summary(lm(price ~ beds + baths + sqFt, houses))
##
## Call:
## lm(formula = price ~ beds + baths + sqFt, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1059639 -173336 -2943 133256 1726320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 257996.58 90481.42 2.851 0.00494 **
## beds -74372.00 32750.28 -2.271 0.02452 *
## baths -106517.53 40902.19 -2.604 0.01010 *
## sqFt 485.79 42.51 11.427 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 319600 on 156 degrees of freedom
## Multiple R-squared: 0.6429, Adjusted R-squared: 0.636
## F-statistic: 93.61 on 3 and 156 DF, p-value: < 2.2e-16
summary(lm(price ~ beds + baths + sqFt + factor(postal_code), houses))
##
## Call:
## lm(formula = price ~ beds + baths + sqFt + factor(postal_code),
## data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -660476 -112371 -6859 94183 1610265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -78026.11 213100.74 -0.366 0.7148
## beds -79171.70 31340.60 -2.526 0.0127 *
## baths 6313.64 40550.07 0.156 0.8765
## sqFt 397.79 40.52 9.817 <2e-16 ***
## factor(postal_code)80203 769494.05 344107.66 2.236 0.0270 *
## factor(postal_code)80204 124229.00 222965.37 0.557 0.5784
## factor(postal_code)80205 310629.15 221289.39 1.404 0.1627
## factor(postal_code)80206 510379.28 212139.52 2.406 0.0175 *
## factor(postal_code)80207 287392.61 226404.81 1.269 0.2065
## factor(postal_code)80209 468572.14 211071.49 2.220 0.0281 *
## factor(postal_code)80210 419897.68 201904.61 2.080 0.0395 *
## factor(postal_code)80211 429761.56 210195.19 2.045 0.0429 *
## factor(postal_code)80212 337384.38 213430.81 1.581 0.1163
## factor(postal_code)80218 542654.15 270245.36 2.008 0.0467 *
## factor(postal_code)80219 183689.42 204317.00 0.899 0.3703
## factor(postal_code)80220 359757.63 205300.53 1.752 0.0820 .
## factor(postal_code)80221 280384.86 248874.92 1.127 0.2620
## factor(postal_code)80222 190246.96 226030.78 0.842 0.4015
## factor(postal_code)80223 343985.33 249096.67 1.381 0.1696
## factor(postal_code)80230 77290.80 333609.55 0.232 0.8171
## factor(postal_code)80231 -192699.81 222696.17 -0.865 0.3884
## factor(postal_code)80236 206581.55 246598.48 0.838 0.4037
## factor(postal_code)80237 -2731.21 216859.89 -0.013 0.9900
## factor(postal_code)80238 18107.75 216759.36 0.084 0.9335
## factor(postal_code)80239 58199.60 226402.32 0.257 0.7975
## factor(postal_code)80246 439988.90 272515.07 1.615 0.1088
## factor(postal_code)80247 -149401.80 330540.88 -0.452 0.6520
## factor(postal_code)80249 -146036.21 210624.39 -0.693 0.4893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269500 on 132 degrees of freedom
## Multiple R-squared: 0.7853, Adjusted R-squared: 0.7413
## F-statistic: 17.88 on 27 and 132 DF, p-value: < 2.2e-16
People in Denver do not enjoy having extra bedrooms and bathrooms, it seems. Having one more bedroom was associated with about a ~75K decrease in price and having one more bathroom with a ~106K decrease! They do, however, seem to value extra space. The coefficient on squareFootage is positive and statistically significant (p < 0.001). Just don’t use it on comfort! Including postal-code fixed effects shows bathrooms may have some value after all (coeff est. = 6313.64, not statistically significant at usual levels) but people don’t care for bedrooms (coeff est. = -79171, p = 0.0127). Adding postal_code dummies reduces standard errors (somewhat) and improves R-sq (64.3% to 78.5%)
Some other house characteristics might be of interest, eg. Year Built, Major renovations etc. Zillow has information on year of construction too. Other factors might be access to important sites (recreational, business-related) etc. Also, perhaps crime rates, high-school graduation rates etc. although all of these are probably captured by the zipcode.
The use of GIS datasets, especially one spanning multiple decades, might be informative as well.