We are using Zillow for house listings in Athens, Ga. We open the zillow website in our browser and search for any 1+ bedrooms in Athens, GA. This was the url of my search.
This section includes the Bonus Task #1 where I have presented a general code that can be used for different locations and different number of search pages. The results page include 334 results spread over 9 pages. This section below calculates the number of search pages and uses that count in the loop below to construct our dataframe via web-scraping.
# This set of code tells me the number of search pages to make the code general as possible
zillow_url <- "https://www.zillow.com/athens-ga/houses/1-_beds/1.0-_baths/?searchQueryState={%22pagination%22:{},%22usersSearchTerm%22:%22Athens,%20GA%22,%22mapBounds%22:{%22west%22:-83.64916852832032,%22east%22:-83.15341047167969,%22south%22:33.77025301136187,%22north%22:34.14580381097827},%22regionSelection%22:[{%22regionId%22:23534,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:1},%22baths%22:{%22min%22:1},%22sort%22:{%22value%22:%22globalrelevanceex%22},%22land%22:{%22value%22:false},%22tow%22:{%22value%22:false},%22manu%22:{%22value%22:false}},%22isListVisible%22:true}"
numberoftiles =40 # This is the number of tiles of listings that show up on Zillow.
zillow_pg <- read_html(zillow_url)
# count the total results and set the page count
zillow_cnt <- zillow_pg %>% html_nodes('.result-count') %>% html_text() %>% str_remove(' results') %>% as.numeric()
zillow_pg_cnt = ceiling(zillow_cnt /numberoftiles)
## Now to construct a dataframe to loop over the number of search pages.
res_all <- NULL
for (i in 1:zillow_pg_cnt) {
pg <- read_html(str_c('https://www.zillow.com/athens-ga/houses/1-_beds/1.0-_baths/',i,'_p','/?searchQueryState={%22pagination%22:{%22currentPage%22:',i,'},%22usersSearchTerm%22:%22Athens,%20GA%22,%22mapBounds%22:{%22west%22:-83.64916852832032,%22east%22:-83.15341047167969,%22south%22:33.77025301136187,%22north%22:34.14580381097827},%22regionSelection%22:[{%22regionId%22:23534,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:1},%22baths%22:{%22min%22:1},%22sort%22:{%22value%22:%22globalrelevanceex%22},%22land%22:{%22value%22:false},%22tow%22:{%22value%22:false},%22manu%22:{%22value%22:false}},%22isListVisible%22:true}'))
res_pg <- tibble(
address= pg %>% html_nodes(".list-card-info a") %>% html_text(),
pricelisted = pg %>% html_nodes(".list-card-price") %>% html_text(),
bedrooms = pg %>% html_nodes(".list-card-details li:nth-child(1)") %>%
html_text() %>% str_remove(' bd.*') %>% as.numeric(),
bathrooms = pg %>% html_nodes(".list-card-details li:nth-child(2)") %>%
html_text() %>% str_remove(' ba') %>% as.numeric(),
squarefoot = pg %>% html_nodes(".list-card-details li:nth-child(3)") %>%
html_text() %>% str_remove(' sqft'),
typeofsale = pg %>% html_nodes(".list-card-type") %>%
html_text(),
zillowcomments = pg %>% html_nodes(".list-card-top") %>%
html_text(),
zillow_link = pg %>% html_nodes(".list-card-info a") %>% html_attr('href')
)
res_all <- res_all %>% bind_rows(res_pg)
}
# Cleaning the file - Zillow_DF
zillow_df <- res_all %>%
mutate(pricelisted = as.numeric(str_replace_all(pricelisted,"[^0-9]*",""))) %>%
mutate(squarefoot = as.numeric(str_replace(squarefoot,",","")))
## Warning: NAs introduced by coercion
# create new dataset without missing data
zillow_df <- na.omit(zillow_df)
Now that data frame is ready we can see what it looks like. After cleaning the data, we have 308 observations. Below shows the how my dataframe looks like for the last few observations.
# 1) Number of Observations :
nrow(zillow_df)
## [1] 308
zillow_df %>% select(1:5) %>% tail()
## # A tibble: 6 x 5
## address pricelisted bedrooms bathrooms squarefoot
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 435 Brookwood Dr, Athens, GA 30605 240229 4 3 2039
## 2 125 Putters Dr, Athens, GA 30607 262504 3 3 2569
## 3 170 Cook Cir, Athens, GA 30601 122791 3 1 925
## 4 455 Sandstone Dr, Athens, GA 30605 269410 4 3 2464
## 5 102 Kara Dr, Athens, GA 30606 233593 4 3 2548
## 6 107 Whitehead Ter E, Athens, GA 306~ 162982 3 2 1358
attach(zillow_df) ## For easy access
## Outlier in Price
boxplot(pricelisted, main="Boxplot of House Prices",
ylab = "House Price Listed ($)", yaxt = "n",
col = "gold", medcol = "red", boxlty = 0, axes=FALSE,
whisklty = 1, staplelwd = 4, outpch = 8, outcex = 1)
axis(2, at = seq(0, max(pricelisted), 70000), las = 2, cex.axis=0.5)
## Relationship between Price and Squarefoot.
plot(pricelisted ~ squarefoot,
pch=15, col = rgb((1:7)/7, 0, 0)[as.factor(bedrooms)],
yaxt = "n", xaxt = "n", axes = FALSE, main="",
ylab="House Prices Listed ($)",
xlab = "Square Foot Of Houses")
axis(2, at = seq(0, max(pricelisted), 125000), las = 2, cex.axis=0.5)
axis(1,at = seq(300, max(squarefoot), 500), las=2, cex.axis = 0.6 )
legend("bottomright", legend = paste("Bedrooms", 1:7),
col = rgb((1:7)/7, 0, 0), pch = 15)
title("Here is a simple plot of Square footage vs. price \ngrouped by number of bedrooms", cex.main = 1, font.main= 2, col.main=rgb(1, 0, 0) )
olsmodel <- lm(pricelisted ~ bedrooms + bathrooms + squarefoot)
summary(olsmodel)
##
## Call:
## lm(formula = pricelisted ~ bedrooms + bathrooms + squarefoot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -478936 -86809 -26342 46713 833992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38741.6 31083.0 1.246 0.213580
## bedrooms -41459.3 12676.9 -3.270 0.001197 **
## bathrooms 45238.8 11982.8 3.775 0.000192 ***
## squarefoot 147.4 12.6 11.699 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 156700 on 304 degrees of freedom
## Multiple R-squared: 0.6486, Adjusted R-squared: 0.6451
## F-statistic: 187 on 3 and 304 DF, p-value: < 2.2e-16
olsmodel$coefficients
## (Intercept) bedrooms bathrooms squarefoot
## 38741.6281 -41459.2630 45238.7803 147.4233
In the above model results, we see that having an additional bedroom in a home decreases the price of a house by -4.145926310^{4} dollars, ceteris paribus. It is statistically significant at 1% level of significance.
Comment : Since bedrooms is a categorical variable treating it as continuous is not capturing the non-linearity in the bedroom capacity and price of houses.
In the above model results, we see that having an additional bathroom in a home increases the price of a house by 4.52387810^{4} dollars, ceteris paribus. It is statistically significant at 1% level of significance.
Comment : Since bathroom is a categorical variable treating it as continuous is not capturing the non-linearity in the bathroom capacity and price of houses.
The correlational relation observed in the above plot is found to be evident in the OLS results above. In the above model results, we see that increasing the squarefoot by 1 unit increases the price of a house by 147.4233309 , ceteris paribus. It is statistically significant at 1% level of significance.
In this section we take the log of prices and log of squarefoot to control for extreme distribution of the variables. We find the same relation as above in the un-transformed variable. Here the coefficient against “log(squarefoot)” is the price elasticity of increasing squarefoot by 1 unit. All variables are statistically significant at 1% level of significance.
## In Log -form to capture coefficient as elasticity
olslogmodel <- lm(log(pricelisted) ~ bedrooms + bathrooms + log(squarefoot))
summary(olslogmodel)
##
## Call:
## lm(formula = log(pricelisted) ~ bedrooms + bathrooms + log(squarefoot))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5364 -0.2361 -0.0574 0.2107 1.2888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.23237 0.52620 11.844 < 2e-16 ***
## bedrooms -0.10228 0.03426 -2.985 0.00307 **
## bathrooms 0.14672 0.02921 5.022 8.72e-07 ***
## log(squarefoot) 0.82703 0.08238 10.039 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3999 on 304 degrees of freedom
## Multiple R-squared: 0.6279, Adjusted R-squared: 0.6242
## F-statistic: 171 on 3 and 304 DF, p-value: < 2.2e-16
olslogmodel$coefficients
## (Intercept) bedrooms bathrooms log(squarefoot)
## 6.2323660 -0.1022793 0.1467194 0.8270350
Find the r squared using the summary(olsmodel) command. How good is the fit?
To improve the fit, a couple of things could be done:
Distance To the University of Georgia campus. This can be done by extracting the latitudes and longitudes of each home and calculate the distance to UGA all form Zillow html source code. The hypothesis is that houses closer to UGA would be valued higher.
How long the house has been listed? Another feature could be to gather the information on how long a house has been on the market on Zillow. The hypothesis is that longer the listing listed on Zillow the lower the price as that could indicate some trouble selling the house.
Longitudinal Analysis: The model can be better improved by constructing a panel data with time varying causal variable like pollution particulate count in each county/zipcode/district/census tract, etc. Panel data will help to control for unobserved heterogeneity. the pollution data can be extracted from EPA’s AQS (Air Quality System) database
Fancier: Image Analysis: extract from the images posted on Zillow whether the house shows a garden, porch, trees, flowers, etc.
## This extracts the numeric terms from days on Zillow comments.
zillow_df$daysonZillow <- as.numeric(gsub("([0-9]+).*$", "\\1", zillow_df$zillowcomments))
## Warning: NAs introduced by coercion
##
zillow_df.new <- na.omit(zillow_df)
## We have now 222 houses that have information on how many days on Zillow
attach(zillow_df.new)
## The following objects are masked from zillow_df:
##
## address, bathrooms, bedrooms, pricelisted, squarefoot, typeofsale,
## zillow_link, zillowcomments
##Running OLS With this new variable and the new Zillow dataset.
olsmodel.new <- lm(pricelisted ~ bedrooms + bathrooms + squarefoot+ daysonZillow)
summary(olsmodel.new)
##
## Call:
## lm(formula = pricelisted ~ bedrooms + bathrooms + squarefoot +
## daysonZillow)
##
## Residuals:
## Min 1Q Median 3Q Max
## -392605 -88805 -37775 34035 817543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18170.87 37621.34 0.483 0.62958
## bedrooms -25534.82 15849.12 -1.611 0.10859
## bathrooms 49015.90 15625.87 3.137 0.00194 **
## squarefoot 119.83 16.94 7.072 2.03e-11 ***
## daysonZillow 237.53 102.04 2.328 0.02083 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 161400 on 219 degrees of freedom
## Multiple R-squared: 0.5605, Adjusted R-squared: 0.5525
## F-statistic: 69.83 on 4 and 219 DF, p-value: < 2.2e-16
olsmodel.new$coefficients
## (Intercept) bedrooms bathrooms squarefoot daysonZillow
## 18170.8734 -25534.8181 49015.9009 119.8271 237.5253
Here I am sing the number of days house was listed on Zillow. We have 224 Houses that have information on “number of days on Zillow”.
COMMENTS:
Here, the variable “days on Zillow” is statistically insignificant.
Also, the sign is contrary to the hypothesis that longer the houses are on the market the lower the prices as it would indicate a house having troubke selling. Here, we see a positive sign.
There is obviously an endogeneity concern with “days on zillow”.