My search is about houses with at least 2 bedrooms and 1.5 bathrooms in Silver Spring, MD. Zillow is the website used. The link to my search is available here.
knitr::opts_chunk$set(echo = TRUE, eval=TRUE, message=FALSE, warning=FALSE, fig.height=4)
necessaryPackages <- c("rvest","tidyverse","dplyr","stringr","ggplot2")
new.packages <- necessaryPackages[
!(necessaryPackages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(necessaryPackages, require, character.only = TRUE)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
My search is spread over 7 pages, which I collate below.
page1 <- read_html("https://www.zillow.com/silver-spring-md/houses/2-_beds/1.5-_baths/?searchQueryState={%22pagination%22:{},%22usersSearchTerm%22:%22Silver%20Spring,%20MD%22,%22mapBounds%22:{%22west%22:-77.22492031640625,%22east%22:-76.80194668359375,%22south%22:38.953272064835055,%22north%22:39.19208935466388},%22regionSelection%22:[{%22regionId%22:7081,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:2},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false},%22baths%22:{%22min%22:1.5}},%22isListVisible%22:true}")
page2 <- read_html("https://www.zillow.com/silver-spring-md/houses/2-_beds/1.5-_baths/2_p/?searchQueryState={%22pagination%22:{%22currentPage%22:2},%22usersSearchTerm%22:%22Silver%20Spring,%20MD%22,%22mapBounds%22:{%22west%22:-77.22492031640625,%22east%22:-76.80194668359375,%22south%22:38.953272064835055,%22north%22:39.19208935466388},%22regionSelection%22:[{%22regionId%22:7081,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:2},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false},%22baths%22:{%22min%22:1.5}},%22isListVisible%22:true}")
page3 <- read_html("https://www.zillow.com/silver-spring-md/houses/2-_beds/1.5-_baths/3_p/?searchQueryState={%22pagination%22:{%22currentPage%22:3},%22usersSearchTerm%22:%22Silver%20Spring,%20MD%22,%22mapBounds%22:{%22west%22:-77.22492031640625,%22east%22:-76.80194668359375,%22south%22:38.953272064835055,%22north%22:39.19208935466388},%22regionSelection%22:[{%22regionId%22:7081,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:2},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false},%22baths%22:{%22min%22:1.5}},%22isListVisible%22:true}")
page4 <- read_html("https://www.zillow.com/silver-spring-md/houses/2-_beds/1.5-_baths/4_p/?searchQueryState={%22pagination%22:{%22currentPage%22:4},%22usersSearchTerm%22:%22Silver%20Spring,%20MD%22,%22mapBounds%22:{%22west%22:-77.22492031640625,%22east%22:-76.80194668359375,%22south%22:38.953272064835055,%22north%22:39.19208935466388},%22regionSelection%22:[{%22regionId%22:7081,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:2},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false},%22baths%22:{%22min%22:1.5}},%22isListVisible%22:true}")
page5 <- read_html("https://www.zillow.com/silver-spring-md/houses/2-_beds/1.5-_baths/5_p/?searchQueryState={%22pagination%22:{%22currentPage%22:5},%22usersSearchTerm%22:%22Silver%20Spring,%20MD%22,%22mapBounds%22:{%22west%22:-77.22492031640625,%22east%22:-76.80194668359375,%22south%22:38.953272064835055,%22north%22:39.19208935466388},%22regionSelection%22:[{%22regionId%22:7081,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:2},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false},%22baths%22:{%22min%22:1.5}},%22isListVisible%22:true}")
page6 <- read_html("https://www.zillow.com/silver-spring-md/houses/2-_beds/1.5-_baths/6_p/?searchQueryState={%22pagination%22:{%22currentPage%22:6},%22usersSearchTerm%22:%22Silver%20Spring,%20MD%22,%22mapBounds%22:{%22west%22:-77.22492031640625,%22east%22:-76.80194668359375,%22south%22:38.953272064835055,%22north%22:39.19208935466388},%22regionSelection%22:[{%22regionId%22:7081,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:2},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false},%22baths%22:{%22min%22:1.5}},%22isListVisible%22:true}")
page7 <- read_html("https://www.zillow.com/silver-spring-md/houses/2-_beds/1.5-_baths/7_p/?searchQueryState={%22pagination%22:{%22currentPage%22:7},%22usersSearchTerm%22:%22Silver%20Spring,%20MD%22,%22mapBounds%22:{%22west%22:-77.22492031640625,%22east%22:-76.80194668359375,%22south%22:38.953272064835055,%22north%22:39.19208935466388},%22regionSelection%22:[{%22regionId%22:7081,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:2},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false},%22baths%22:{%22min%22:1.5}},%22isListVisible%22:true}")
allpages <- list(page1, page2, page3, page4, page5, page6, page7)
The functions below will be used to scrap the housing search.
scrapPrice <- function(html){
html %>% html_nodes('.list-card-price') %>% html_text() %>% str_trim()
}
scrapDet <- function(html){
html %>% html_nodes('.list-card-details') %>% html_text() %>% str_trim()
}
scrapAddr <- function(html){
html %>% html_nodes('.list-card-addr') %>% html_text() %>% str_trim()
}
I obtain the raw dataset using the functions above.
rawDataset = NULL
for (i in 1:7){
Price <- scrapPrice(allpages[[i]])
Det <- scrapDet(allpages[[i]])
Addr <- scrapAddr(allpages[[i]])
rawDataset <- rbind(rawDataset, data.frame(cbind(Price, Det, Addr)))
}
Here is a description of the raw data.
str(rawDataset)
## 'data.frame': 18 obs. of 3 variables:
## $ Price: Factor w/ 18 levels "$175,000","$339,000",..: 6 5 1 3 4 2 9 7 8 16 ...
## $ Det : Factor w/ 18 levels "2 bds2 ba1,273 sqft- Condo for sale",..: 4 6 1 5 2 3 8 7 9 14 ...
## $ Addr : Factor w/ 18 levels "10301 Naglee Rd, Silver Spring, MD 20903",..: 3 5 8 2 4 9 1 6 7 15 ...
I clean the data using the code given in the assignment, but I exclude incomplete observations. The clean dataset characterizes each house with two count variables (number of bedrooms and number of bathrooms), two continuous variables (price and square footage), and the address of the house.
cleanDataset <- rawDataset %>%
mutate(bedrooms = as.integer(str_trim(str_extract(Det, "[\\d ]*(?=bds)")))) %>%
mutate(bathrooms = as.integer(str_trim(str_extract(Det, "[\\d ]*(?=ba)")))) %>%
mutate(sqft = str_trim(str_extract(Det, "[\\d ,]*(?=sqft)"))) %>%
mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
mutate(price = as.numeric(str_replace_all(Price,"[^0-9]*","")))
cleanDataset <- na.omit(cleanDataset)
str(cleanDataset)
## 'data.frame': 18 obs. of 7 variables:
## $ Price : Factor w/ 18 levels "$175,000","$339,000",..: 6 5 1 3 4 2 9 7 8 16 ...
## $ Det : Factor w/ 18 levels "2 bds2 ba1,273 sqft- Condo for sale",..: 4 6 1 5 2 3 8 7 9 14 ...
## $ Addr : Factor w/ 18 levels "10301 Naglee Rd, Silver Spring, MD 20903",..: 3 5 8 2 4 9 1 6 7 15 ...
## $ bedrooms : int 3 4 2 4 3 3 4 4 5 4 ...
## $ bathrooms: int 2 3 2 2 2 2 4 3 3 3 ...
## $ sqft : num 2276 1458 1273 1914 1296 ...
## $ price : num 480000 410000 175000 365000 399900 ...
summary(cleanDataset)
## Price Det
## $175,000: 1 2 bds2 ba1,273 sqft- Condo for sale: 1
## $339,000: 1 3 bds2 ba1,296 sqft- House for sale: 1
## $365,000: 1 3 bds2 ba1,517 sqft- Condo for sale: 1
## $399,900: 1 3 bds2 ba2,276 sqft- House for sale: 1
## $410,000: 1 4 bds2 ba1,914 sqft- House for sale: 1
## $480,000: 1 4 bds3 ba1,458 sqft- House for sale: 1
## (Other) :12 (Other) :12
## Addr bedrooms
## 10301 Naglee Rd, Silver Spring, MD 20903 : 1 Min. :2.000
## 1110 Ruatan St, Silver Spring, MD 20903 : 1 1st Qu.:3.000
## 11720 College View Dr, Silver Spring, MD 20902: 1 Median :4.000
## 11807 Goodloe Rd, Silver Spring, MD 20906 : 1 Mean :3.611
## 11815 Dewey Rd, Silver Spring, MD 20906 : 1 3rd Qu.:4.000
## 1253 Cresthaven Dr, Silver Spring, MD 20903 : 1 Max. :6.000
## (Other) :12
## bathrooms sqft price
## Min. :2.000 Min. :1079 Min. : 175000
## 1st Qu.:2.000 1st Qu.:1279 1st Qu.: 276000
## Median :2.000 Median :1529 Median : 445000
## Mean :2.722 Mean :1736 Mean : 513089
## 3rd Qu.:3.000 3rd Qu.:2178 3rd Qu.: 592474
## Max. :5.000 Max. :2752 Max. :1300000
##
To explore a linear relationship between price and square footage, I will plot square footage against price. However, I’m also interested in having some bedrooms without bathrooms for a couple reasons: (i) office room, (ii) storage room, and (iii) bedrooms sharing bathrooms.
# diff: number of bedrooms minus number of bathrooms
cleanDataset$bedBath <- cleanDataset$bedrooms-cleanDataset$bathrooms
# new variable based on diff
cleanDataset$bedBathCategory <- "NA"
cleanDataset$bedBathCategory[cleanDataset$bedBath<0] <- "fewer beds than baths"
cleanDataset$bedBathCategory[cleanDataset$bedBath==0] <- "same number of beds and baths"
cleanDataset$bedBathCategory[cleanDataset$bedBath==1] <- "one bed without baths"
cleanDataset$bedBathCategory[cleanDataset$bedBath>=2] <- "two or more beds without baths"
cleanDataset$bedBathCategory = as.factor(cleanDataset$bedBathCategory)
The plot shows two main visual results: (1) most houses are in the quadrant spanning square footage <= 5,000 sqft and price <= $750,000; (2) in this quadrant, the relationship between price and square footage is non-linear, rather logarithmic or quadratic; (3) there is a concentration of most houses with at least one bedroom without bathroom in this quadrant.
graph1 <- ggplot(data = cleanDataset, aes(x = sqft, y = price, color = bedBathCategory)) +
geom_point() +
labs(y = "Price ($)",
x = "Square footage (squared feet)",
title = "Square footage vs. Price of houses in Silver Spring, MD")
graph1
To explore a non-linear relationship between price and square footage in the regressions, I will create the squared value of square footage, as well as the log value of square footage and price. I will also create one dummy variable capturing houses with at least one bedroom without bathroom, which can be interacted with the number of bedrooms; another dummy variable will define the quadrant that excludes the outliers.
# squared sqft
cleanDataset$squared_sqft <- cleanDataset$sqft*cleanDataset$sqft
# log variables
cleanDataset$log_sqft <- log(cleanDataset$sqft)
cleanDataset$log_price <- log(cleanDataset$price)
# dummy variables
cleanDataset$nonOutlier <- ifelse(cleanDataset$sqft<=5000 & cleanDataset$price<=750000,1,0)
#cleanDataset$nonOutlier = as.factor(cleanDataset$nonOutlier)
cleanDataset$extraBedWithoutBath <- ifelse(cleanDataset$bedBath>=1,1,0)
#cleanDataset$extraBedWithoutBath = as.factor(cleanDataset$extraBedWithoutBath)
# interaction variable
cleanDataset$interactionvar <- cleanDataset$bedrooms * cleanDataset$extraBedWithoutBath
I estimated four models: linear, log, quadratic (with interaction), and quadratic (without interaction). Not surprisingly after the visual analysis, the regression results indicate a better fit (highest R-squared) with the quadratic models.
lmmodel <- lm(price ~ sqft + bedrooms + extraBedWithoutBath + interactionvar + nonOutlier, cleanDataset)
summary(lmmodel)
##
## Call:
## lm(formula = price ~ sqft + bedrooms + extraBedWithoutBath +
## interactionvar + nonOutlier, data = cleanDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107543 -42014 9856 37381 118875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 357022.85 130337.53 2.739 0.017957 *
## sqft 69.02 59.27 1.164 0.266861
## bedrooms 183755.66 41704.44 4.406 0.000856 ***
## extraBedWithoutBath 295589.07 128802.53 2.295 0.040570 *
## interactionvar -117230.42 37517.91 -3.125 0.008779 **
## nonOutlier -579221.29 70097.83 -8.263 2.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76790 on 12 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9387
## F-statistic: 53.08 on 5 and 12 DF, p-value: 9.076e-08
logmodel <- lm(price ~ log_sqft + bedrooms + extraBedWithoutBath + interactionvar + nonOutlier, cleanDataset)
summary(logmodel)
##
## Call:
## lm(formula = price ~ log_sqft + bedrooms + extraBedWithoutBath +
## interactionvar + nonOutlier, data = cleanDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111976 -42170 10036 38879 118612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -459369 710337 -0.647 0.53001
## log_sqft 129794 106302 1.221 0.24554
## bedrooms 176957 44539 3.973 0.00185 **
## extraBedWithoutBath 277420 131077 2.116 0.05589 .
## interactionvar -111386 38673 -2.880 0.01383 *
## nonOutlier -584204 68600 -8.516 1.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76400 on 12 degrees of freedom
## Multiple R-squared: 0.9572, Adjusted R-squared: 0.9393
## F-statistic: 53.64 on 5 and 12 DF, p-value: 8.55e-08
quadintmodel <- lm(price ~ squared_sqft + sqft + bedrooms + extraBedWithoutBath + interactionvar + nonOutlier, cleanDataset)
summary(quadintmodel)
##
## Call:
## lm(formula = price ~ squared_sqft + sqft + bedrooms + extraBedWithoutBath +
## interactionvar + nonOutlier, data = cleanDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -114781 -42242 12950 39793 118484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.936e+05 2.633e+05 1.115 0.28865
## squared_sqft -2.897e-02 1.031e-01 -0.281 0.78399
## sqft 1.802e+02 4.005e+02 0.450 0.66153
## bedrooms 1.750e+05 5.337e+04 3.280 0.00734 **
## extraBedWithoutBath 2.706e+05 1.609e+05 1.682 0.12077
## interactionvar -1.089e+05 4.899e+04 -2.224 0.04807 *
## nonOutlier -5.876e+05 7.884e+04 -7.453 1.27e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79910 on 11 degrees of freedom
## Multiple R-squared: 0.9571, Adjusted R-squared: 0.9336
## F-statistic: 40.85 on 6 and 11 DF, p-value: 6.855e-07
quadmodel <- lm(price ~ squared_sqft + sqft + bedrooms + bathrooms + nonOutlier, cleanDataset)
summary(quadmodel)
##
## Call:
## lm(formula = price ~ squared_sqft + sqft + bedrooms + bathrooms +
## nonOutlier, data = cleanDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93482 -46162 7126 49742 81675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.882e+04 2.197e+05 0.222 0.82787
## squared_sqft -9.545e-02 7.614e-02 -1.254 0.23388
## sqft 4.510e+02 2.910e+02 1.550 0.14715
## bedrooms 5.090e+04 2.216e+04 2.296 0.04046 *
## bathrooms 9.266e+04 2.766e+04 3.350 0.00578 **
## nonOutlier -4.928e+05 8.107e+04 -6.078 5.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70310 on 12 degrees of freedom
## Multiple R-squared: 0.9637, Adjusted R-squared: 0.9486
## F-statistic: 63.77 on 5 and 12 DF, p-value: 3.183e-08
The quadratic model (without interaction) is easier to interpret than its counterpart with interaction. Based on that last model, the marginal effects of the attributes “bedroom” and “bathroom” of a house are either insignificant or significant at the 10% level of confidence (depending on the search data) but both positive, with a magnitude approximately lower than $15,000 for “bedroom” and than $25,000 for “bathroom” (depending on the search data). The marginal effect of not taking an outlying house is significant at the 1% level of confidence and largely negative.
quadmodel$coefficients
## (Intercept) squared_sqft sqft bedrooms bathrooms
## 4.881508e+04 -9.544671e-02 4.509749e+02 5.089963e+04 9.265651e+04
## nonOutlier
## -4.927520e+05
Given these results, I will ideally choose a house with the following characteristcs:
falling within the range of square footage <= 5,000 sqft and price <= $750,000 ;
having just the minimum of bathrooms needed ;
having one or two more bedrooms than bathrooms ;
and having the minimum square footage among houses with the above three characteristics.
Four other variables I would like to consider further for refining my housing search could be: (1) the median or average income in the house neighborhood; (2) the level of safety in the house neighborhood; (3) the availability of medium to large grocery stores in the in the house neighborhood; and (4) my household income and job tenure (critical for affordability).
Note: This analysis could have been better done if my selection of independent variabes was based on the literature on hedonic price for residential buildings in Maryland, in addition to an analysis of the residuals from the retained model.