This is an attempt to scrape Athens,Ga housing market from Zillow to look at the relationship between monthly rent and sqft of the household. To begin, I extract the 2URLs that I will need. There is a way to capture the URLs within the first page from the first page. However, Zillow has updated their webpage to “lazy loading” and the only answer I have been able to find that solves this issue is in the package Selenium for Python. There is the package “RSelenium” that works in R, but there appears to be issues with using it on Windows and I currently do not have Linux dual booted on my laptop.
url1<-'https://www.zillow.com/athens-ga/rentals/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22Athens%2C%20GA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-83.67114168457032%2C%22east%22%3A-83.13143831542969%2C%22south%22%3A33.78280947096356%2C%22north%22%3A34.13330081439908%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A23534%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22apco%22%3A%7B%22value%22%3Afalse%7D%2C%22apa%22%3A%7B%22value%22%3Afalse%7D%2C%22con%22%3A%7B%22value%22%3Afalse%7D%2C%22tow%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D'
url2<-'https://www.zillow.com/athens-ga/rentals/2_p/?searchQueryState=%7B%22pagination%22%3A%7B%22currentPage%22%3A2%7D%2C%22usersSearchTerm%22%3A%22Athens%2C%20GA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-83.67114168457032%2C%22east%22%3A-83.13143831542969%2C%22south%22%3A33.78280947096356%2C%22north%22%3A34.13330081439908%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A23534%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22apco%22%3A%7B%22value%22%3Afalse%7D%2C%22apa%22%3A%7B%22value%22%3Afalse%7D%2C%22con%22%3A%7B%22value%22%3Afalse%7D%2C%22tow%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D'
zillowhtml_1<-read_html(url1)
zillowhtml_2<-read_html(url2)
html_pages <- list(zillowhtml_1, zillowhtml_2)
get_price <- function(html){ html %>%
# The relevant tag
html_nodes('.list-card-price') %>% html_text()
}
get_details <- function(html){ html %>%
# The relevant tag
html_nodes('.list-card-details') %>% html_text()
}
get_address <- function(html){ html %>%
# The relevant tag
html_nodes('.list-card-addr') %>% html_text()
}
houseData = NULL
for (i in 1:length(html_pages)){
price <- get_price(html_pages[[i]])
details <- get_details(html_pages[[i]])
address <- get_address(html_pages[[i]])
houseData <- rbind(houseData, data.frame(cbind(price, details, address)))
}
houseData <- houseData %>%
mutate(bedrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=bds)")))) %>%
mutate(bathrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=ba)")))) %>%
mutate(sqft = str_trim(str_extract(details, "[\\d ,]*(?=sqft)"))) %>%
mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
mutate(price = as.numeric(str_replace_all(price,"[^0-9]*","")))
houseData2<-houseData
houseData2$bedrooms<-as.factor(houseData$bedrooms)
houseData2$bathrooms<-as.factor(houseData$bathrooms)
Since lazy loading prevents the entire page from loading by the URL alone, I attempted to use RSelenium to manually scroll the page and extract the data for me. I ran into multiple errors and eventually found the creator of Selenium stating that there are issues for RSelenium on Windows. This restricts our data to only the 9 results from each page. In addition to this, I found that obtaining URLs from Firefox led to fewer errors in extracting the price data. When I copied them from Chrome multiple house observations had a price that was a random number and the price data went into the “details” variable. From the follwing two graphs, we can see clearly that the monthly rent linearly increases with sqft. I expected a more logarithmic relationship. This may still be true, but unfortunately I was unable to obtain more observations.
plot1 <- ggplot(data=houseData2, aes(x=sqft, y=price, color=bedrooms)) + scale_color_brewer(palette="Set2") +
geom_point() +
theme_minimal() +
theme(legend.position = "right") +
labs(y = "Price in $",
x = "Size in squared feet",
color = NULL,
title = "Monthly rent for houses in Athens, GA")
plot1
# regression equation
fit <- lm(price ~ sqft + bedrooms + bathrooms, data = houseData)
stargazer(fit, type='html')
| Dependent variable: | |
| price | |
| sqft | 0.872*** |
| (0.278) | |
| bedrooms | 113.861 |
| (167.055) | |
| bathrooms | 35.419 |
| (79.920) | |
| Constant | 334.942 |
| (247.729) | |
| Observations | 18 |
| R2 | 0.840 |
| Adjusted R2 | 0.806 |
| Residual Std. Error | 290.443 (df = 14) |
| F Statistic | 24.476*** (df = 3; 14) |
| Note: | p<0.1; p<0.05; p<0.01 |
An additional bedroom or bathroom increases the rent on average $113.86 and $35.42 per month respectively. The goodness of fit, or R-value, is 84%. It is likely that neighborhood, distance to school, quality of school, and especially in Athens, distance to University/downtown would capture much of the missing variables.
houseData$bedrooms = as.factor(houseData$bedrooms)
houseData$bathrooms = as.factor(houseData$bathrooms)
# plot regression line on plot
plot2 <- ggplot(data=houseData, aes(x=sqft, y=price, color = bedrooms)) +
geom_point() +
theme_minimal() +
theme(legend.position = "right") +
labs(y = "Price in $",
x = "Size in squared feet",
color = NULL,
title = "Monthly rent for houses and apartments in Athens, GA") +
geom_smooth(method = lm, color = "turquoise")
plot2
## `geom_smooth()` using formula 'y ~ x'
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.