I pulled data from Zillow for the houses in Boseman, Montana. (My sister is looking at possibly moving there)
necessaryPackages <- c("rvest", "tidyverse")
new.packages <- necessaryPackages[
!(necessaryPackages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(necessaryPackages, require, character.only = TRUE)
## Loading required package: rvest
## Warning: package 'rvest' was built under R version 3.6.3
## Loading required package: xml2
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x purrr::pluck() masks rvest::pluck()
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
library(dplyr)
#making the links
links <-c()
#split via the 'current page' number
beg <- "https://www.zillow.com/bozeman-mt/houses/?searchQueryState={%22pagination%22:{%22currentPage%22:"
end_link <-"},%22usersSearchTerm%22:%22boseman%20montana%22,%22mapBounds%22:{%22west%22:-111.21946244603778,%22east%22:-110.85554032689716,%22south%22:45.59660806436434,%22north%22:45.809043676801664},%22mapZoom%22:11,%22regionSelection%22:[{%22regionId%22:44281,%22regionType%22:6}],%22isMapVisible%22:true,%22filterState%22:{%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isMultiFamily%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isCondo%22:{%22value%22:false},%22isManufactured%22:{%22value%22:false},%22isApartment%22:{%22value%22:false}},%22isListVisible%22:true}"
#loop for putting the links together
for (i in 1:6){
link<-paste(beg, i, end_link, sep = "")
links <-c(links, link)
}
p_vect <- c()
detail_vect <- c()
address_vect <- c()
This worked for getting the right links together in on file, since there are six pages of results from the search I did. I also wrote a series of programs to get the data parts I wanted from the webpages
get_price <- function(place){
place%>%
# The relevant tag
html_nodes('.list-card-price')%>%
html_text()%>%
str_trim()
}
get_address <- function(place){
place%>%
# The relevant tag
html_nodes('.list-card-addr')%>%
html_text()%>%
str_trim()
}
get_details <-function(place)
{place%>%
# The relevant tag
html_nodes('.list-card-details')%>%
html_text()%>%
str_trim()
}
However when I tried to loop through them to pull the results I kept getting weird amounts of prices that didn’t match up with the number of houses. Also the prices were in the wrong order when I double checked. I ended up having to hard code the data gathering per page and then combine it at the end. That code is as follows:
#hard code for the different links and to get the data.
boseman_1 <- paste(beg, 1, end_link, sep = "")
boseman_2 <- paste(beg, 2, end_link, sep = "")
boseman_3 <- paste(beg, 3, end_link, sep = "")
boseman_4 <- paste(beg, 4, end_link, sep = "")
boseman_5 <- paste(beg, 5, end_link, sep = "")
boseman_6 <- paste(beg, 6, end_link, sep = "")
#hard code the read for the HTML
Bose_1 <- read_html(boseman_1)
Bose_2 <- read_html(boseman_2)
Bose_3 <- read_html(boseman_3)
Bose_4 <- read_html(boseman_4)
Bose_5 <- read_html(boseman_5)
Bose_6 <- read_html(boseman_6)
#pulling prices
p_1 <- get_price(Bose_1)
p_2 <- get_price(Bose_2)
p_3 <- get_price(Bose_3)
p_4 <- get_price(Bose_4)
p_5 <- get_price(Bose_5)
p_6 <- get_price(Bose_6)
#concat. all price into one vect
p_vect_2 <- c(p_1, p_2, p_3, p_4, p_5, p_6)
#hard code for details
d_1 <- get_details(Bose_1)
d_2 <- get_details(Bose_2)
d_3 <- get_details(Bose_3)
d_4 <- get_details(Bose_4)
d_5 <- get_details(Bose_5)
d_6 <- get_details(Bose_6)
#concat. into one vect
detail_vect_2 <- c(d_1, d_2, d_3, d_4, d_5, d_6)
a_1 <- get_address(Bose_1)
a_2 <- get_address(Bose_2)
a_3 <- get_address(Bose_3)
a_4 <- get_address(Bose_4)
a_5 <- get_address(Bose_5)
a_6 <- get_address(Bose_6)
address_vect_2 <- c(a_1, a_2, a_3, a_4, a_5, a_6)
#compling all data
Boseman_data <- data.frame(p_vect_2, detail_vect_2, address_vect_2)
Ended up using the data cleaning methods as per then example:
#data cleaning
Boseman_data <- Boseman_data%>%
mutate(bedrooms = as.integer(str_trim(str_extract(detail_vect_2, "[\\d ]*(?=bds)")))) %>%
mutate(bathrooms = as.integer(str_trim(str_extract(detail_vect_2, "[\\d ]*(?=ba)")))) %>%
mutate(sqft = str_trim(str_extract(detail_vect_2, "[\\d ,]*(?=sqft)"))) %>%
mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
mutate(price = as.numeric(str_replace_all(p_vect_2,"[^0-9]*","")))
At first I tried including all of the data in one plot, which looked cool but was hard to read:
library(ggplot2)
ggplot(data = Boseman_data, aes(sqft, price)) +
geom_point(aes(size=bedrooms, color = bathrooms), alpha = 0.5)+
scale_size(range = c(1,5))+
ggtitle("Boseman houses, sqft vs price, size = bedrooms, color= bathooms")+
theme(plot.title = element_text(size = 11, face = "bold")) +
theme( plot.title = element_text(hjust = 0.5)) +theme_classic()
## Warning: Removed 6 rows containing missing values (geom_point).
So I ended up using the color gradiant scale and made one graph for bedrooms and another for bathrooms.
ggplot(data = Boseman_data, aes(sqft, price)) +
geom_point(aes(color = bedrooms), alpha = 0.5)+
ggtitle("Boseman houses, sqft vs price, color = bedrooms")+
theme(plot.title = element_text(size = 11, face = "bold")) +
theme( plot.title = element_text(hjust = 0.5)) + theme_classic()
## Warning: Removed 3 rows containing missing values (geom_point).
ggplot(data = Boseman_data, aes(sqft, price)) +
geom_point(aes(color = bathrooms), alpha = 0.5)+
ggtitle("Boseman houses, sqft vs price, color = bathrooms")+
theme(plot.title = element_text(size = 11, face = "bold")) +
theme( plot.title = element_text(hjust = 0.5)) +theme_classic()
## Warning: Removed 3 rows containing missing values (geom_point).
I actually ended up making two models, one with a price and another with log price, for the obvious reason that price is much higher than bathroom and bedroom numbers and also the price measures cover such a large spread.
model <- lm(price ~ bedrooms + sqft +bathrooms, Boseman_data)
model$coefficients
## (Intercept) bedrooms sqft bathrooms
## 604646.261 -437469.652 684.751 -11198.811
Boseman_data$log_price <- log(Boseman_data$price)
model <- lm(log_price ~ bedrooms + sqft +bathrooms, Boseman_data)
model$coefficients
## (Intercept) bedrooms sqft bathrooms
## 13.0313428033 -0.1513425543 0.0003418344 0.0137051114
As for interpreting the results, the sqft coefficients make sense at least sign wise, as the more square feet in a house the more expensive it would be. The other two variables, bedrooms and bathrooms however, both have negitive coefficients, meaning the more bedrooms the less the house will cost. This may be due to some outliers perhaps or maybe a trend in houses in the area where more open common area is prefered over more bedrooms and bathrooms. More discussion into this unlikely outcome would be interesting.