Open the zillow website in your browser and search for a home of your interest in an area of interest. I chose any 1+ bedrooms in Athens, GA. It’s ok if you do it with another geographic area if you are so inclined, but you may end up with very many/few observations. This was the url of my search.
knitr::opts_chunk$set(echo = TRUE, eval=TRUE, message=FALSE, warning=FALSE, fig.height=4)
necessaryPackages <- c("rvest","tidyverse","dplyr","stringr","ggplot2")
new.packages <- necessaryPackages[
!(necessaryPackages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(necessaryPackages, require, character.only = TRUE)
## Loading required package: rvest
## Loading required package: tidyverse
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 3.0.0 ✓ dplyr 1.0.2
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
My search is about houses with at least 2 bedrooms and any bathrooms choices in Athens, GA. Zillow is the website used.My search is spread over 3 pages, which I collate below.
page1 <- read_html("https://www.zillow.com/athens-ga/1-_beds/?searchQueryState={%22pagination%22:{},%22mapBounds%22:{%22west%22:-85.89432128906252,%22east%22:-80.78567871093752,%22south%22:32.08423540326219,%22north%22:35.748103174169024},%22regionSelection%22:[{%22regionId%22:23534,%22regionType%22:6}],%22isMapVisible%22:false,%22mapZoom%22:8,%22filterState%22:{%22beds%22:{%22min%22:1}},%22isListVisible%22:true}")
page2 <- read_html("https://www.zillow.com/athens-ga/houses/1-_beds/2_p/?searchQueryState={%22pagination%22:{%22currentPage%22:2},%22usersSearchTerm%22:%22Athens,%20GA%22,%22mapBounds%22:{%22west%22:-83.72057966113282,%22east%22:-83.08199933886719,%22south%22:33.742964913358136,%22north%22:34.172963028242684},%22regionSelection%22:[{%22regionId%22:23534,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:1},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false}},%22isListVisible%22:true}")
page3 <- read_html("https://www.zillow.com/athens-ga/houses/1-_beds/3_p/?searchQueryState={%22pagination%22:{%22currentPage%22:3},%22usersSearchTerm%22:%22Athens,%20GA%22,%22mapBounds%22:{%22west%22:-83.72057966113282,%22east%22:-83.08199933886719,%22south%22:33.728974965801584,%22north%22:34.18688016441001},%22regionSelection%22:[{%22regionId%22:23534,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:1},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false}},%22isListVisible%22:true}")
page4 <- read_html("https://www.zillow.com/athens-ga/houses/1-_beds/4_p/?searchQueryState={%22pagination%22:{%22currentPage%22:3},%22usersSearchTerm%22:%22Athens,%20GA%22,%22mapBounds%22:{%22west%22:-83.72057966113282,%22east%22:-83.08199933886719,%22south%22:33.728974965801584,%22north%22:34.18688016441001},%22regionSelection%22:[{%22regionId%22:23534,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:1},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false}},%22isListVisible%22:true}")
page5 <- read_html("https://www.zillow.com/athens-ga/houses/1-_beds/5_p/?searchQueryState={%22pagination%22:{%22currentPage%22:3},%22usersSearchTerm%22:%22Athens,%20GA%22,%22mapBounds%22:{%22west%22:-83.72057966113282,%22east%22:-83.08199933886719,%22south%22:33.728974965801584,%22north%22:34.18688016441001},%22regionSelection%22:[{%22regionId%22:23534,%22regionType%22:6}],%22isMapVisible%22:true,%22mapZoom%22:11,%22filterState%22:{%22beds%22:{%22min%22:1},%22sortSelection%22:{%22value%22:%22globalrelevanceex%22},%22isManufactured%22:{%22value%22:false},%22isLotLand%22:{%22value%22:false},%22isTownhouse%22:{%22value%22:false}},%22isListVisible%22:true}")
z_pages <- list(page1, page2, page3, page4, page5)
str(z_pages) #it is character data with
## List of 5
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
## $ :List of 2
## ..$ node:<externalptr>
## ..$ doc :<externalptr>
## ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
For me the results were spread on 5 pages, so I scraped 5 times in a loop. Extract the following elements: - Price of the house - Details of the house (bathrooms, bedrooms, square footage, anything else you want) The functions below will be used to scrap the housing search.
scrapPrice <- function(html){
html %>% html_nodes('.list-card-price') %>% html_text() %>% str_trim()
}
scrapDet <- function(html){
html %>% html_nodes('.list-card-details') %>% html_text() %>% str_trim()
}
scrapAddr <- function(html){
html %>% html_nodes('.list-card-addr') %>% html_text() %>% str_trim()
}
rawDataset = NULL
for (i in 1:5){
Price <- scrapPrice(z_pages[[i]])
Det <- scrapDet(z_pages[[i]])
Addr <- scrapAddr(z_pages[[i]])
rawDataset <- rbind(rawDataset, data.frame(cbind(Price, Det, Addr)))
}
str(rawDataset)
## 'data.frame': 120 obs. of 3 variables:
## $ Price: Factor w/ 90 levels "$1,325,000","$1,395,000",..: 2 10 21 28 33 10 36 5 1 32 ...
## $ Det : Factor w/ 101 levels "18 bds18 ba8,352 sqft- Condo for sale",..: 1 27 22 6 35 14 33 25 36 37 ...
## $ Addr : Factor w/ 104 levels "100 Duncan Springs Rd, Athens, GA 30606",..: 2 19 15 25 11 10 36 14 6 29 ...
##Do some cleaning You will neeed to split the string that gives bathrooms, bedrooms and square footage. This might involve some use of Regular Expressions. To avoid a student uprizing, here is a hint: mutate(bedrooms = as.integer(str_trim(str_extract(details, "[\d ]*(?=bds)"))))
I obtain the raw dataset using the functions below:
Dataset <- rawDataset %>%
mutate(bedrooms = as.integer(str_trim(str_extract(Det, "[\\d ]*(?=bds)")))) %>%
mutate(bathrooms = as.integer(str_trim(str_extract(Det, "[\\d ]*(?=ba)")))) %>%
mutate(sqft = str_trim(str_extract(Det, "[\\d ,]*(?=sqft)"))) %>%
mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
mutate(price = as.numeric(str_replace_all(Price,"[^0-9]*","")))
Dataset <- na.omit(Dataset)
str(Dataset)
## 'data.frame': 105 obs. of 7 variables:
## $ Price : Factor w/ 90 levels "$1,325,000","$1,395,000",..: 2 10 21 28 33 10 36 5 32 34 ...
## $ Det : Factor w/ 101 levels "18 bds18 ba8,352 sqft- Condo for sale",..: 1 27 22 6 35 14 33 25 37 15 ...
## $ Addr : Factor w/ 104 levels "100 Duncan Springs Rd, Athens, GA 30606",..: 2 19 15 25 11 10 36 14 29 17 ...
## $ bedrooms : int 18 4 3 2 4 3 4 4 5 3 ...
## $ bathrooms: int 18 2 3 2 5 2 4 1 4 2 ...
## $ sqft : num 8352 850 1599 1579 5224 ...
## $ price : num 1395000 189900 375000 464900 679900 ...
## - attr(*, "na.action")= 'omit' Named int 9 16 17 35 41 47 66 73 85 90 ...
## ..- attr(*, "names")= chr "9" "16" "17" "35" ...
summary(Dataset)
## Price Det
## $189,900: 6 4 bds3 ba1,835 sqft- House for sale : 4
## $375,000: 3 2 bds2 ba1,579 sqft- House for sale : 2
## $475,000: 3 2 bds3 ba1,574 sqft- House for sale : 2
## $659,900: 3 3 bds2 ba1,200 sqft- House for sale : 2
## $299,000: 2 3 bds2 ba1,922 sqft- House for sale : 2
## $309,900: 2 3 bds3 ba1,599 sqft- New construction: 2
## (Other) :86 (Other) :91
## Addr bedrooms
## 11206 Jefferson Rd, Athens, GA 30607 : 2 Min. : 2.000
## 140 Rosewood Pl, Athens, GA 30606 : 2 1st Qu.: 3.000
## 150 Wedgefield Ln, Athens, GA 30607 : 2 Median : 4.000
## 160 Ruthwood Ln, Athens, GA 30606 : 2 Mean : 3.733
## 180 Ridgewood Pl, Athens, GA 30606 : 2 3rd Qu.: 4.000
## 210 Jefferson River Rd, Athens, GA 30607: 2 Max. :18.000
## (Other) :93
## bathrooms sqft price
## Min. : 1.000 Min. : 780 Min. : 72900
## 1st Qu.: 2.000 1st Qu.: 1429 1st Qu.: 254900
## Median : 3.000 Median : 1899 Median : 439900
## Mean : 3.286 Mean : 2534 Mean : 499149
## 3rd Qu.: 4.000 3rd Qu.: 3262 3rd Qu.: 669900
## Max. :18.000 Max. :11031 Max. :2399000
##
Final_dataframe <- as.data.frame(Dataset)
Make any plot(s) you think are informative.
To explore a linear relationship between price and square footage, I will plot square footage against price. However, I’m also interested in having some bedrooms without bathrooms for a couple reasons: (i) office room, (ii) storage room, and (iii) bedrooms sharing bathrooms.
plot.beds <- ggplot(data=Final_dataframe, aes(bedrooms))+ geom_histogram()+
labs(title="Histogram for Bedrooms") +
labs(x="number of bedrooms", y="Count")
plot.beds
plot.baths <- ggplot(data=Final_dataframe, aes(bathrooms)) + geom_histogram()+
labs(title="Histogram for Bathrooms") +
labs(x="number of bathrooms", y="Count")
plot.baths
plot.price <- ggplot(data=Final_dataframe, aes(price)) + geom_histogram()+
labs(title="Histogram for Price") +
labs(x="price of homes", y="Count")
plot.price
graph1 <- ggplot(data = Dataset, aes(x = sqft, y = price, color = bedrooms)) +
geom_point() +
labs(y = "Price ($)",
x = "Square footage (squared feet)",
title = "Square footage vs. Price of houses Athens GA")
graph1
We see a positive relationship of price ad square-foot per bedroom sizes. Basically as bedrooms go on, prices and suqare-foot per bedroom aksi goes up. But we see clustering in smaller bedroom and an outlier is seen when number of bedrooms is high, square-foot per bedroom was small.
Run a simple OLS command, as follows (with your variable names, of course) I estimated two models: linear and log. Not surprisingly after the visual analysis, the regression results indicate a better fit (highest R-squared) with the logistic models.
#model 1
#linear model
linear_model <- lm(price~bedrooms+sqft, data=Final_dataframe)
summary(linear_model)
##
## Call:
## lm(formula = price ~ bedrooms + sqft, data = Final_dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -372710 -71528 23765 70501 367317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75159.74 31868.41 2.358 0.0203 *
## bedrooms -18649.12 10406.62 -1.792 0.0761 .
## sqft 194.79 11.48 16.968 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140000 on 102 degrees of freedom
## Multiple R-squared: 0.8184, Adjusted R-squared: 0.8149
## F-statistic: 229.9 on 2 and 102 DF, p-value: < 2.2e-16
#model 2
#non-linear model
nonlinear_model <- lm(log(price)~log(bedrooms)+log(sqft), data=Final_dataframe)
summary(nonlinear_model)
##
## Call:
## lm(formula = log(price) ~ log(bedrooms) + log(sqft), data = Final_dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26260 -0.19347 0.08338 0.20092 0.54001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.14201 0.53111 9.682 4.13e-16 ***
## log(bedrooms) -0.15733 0.13178 -1.194 0.235
## log(sqft) 1.04069 0.08322 12.505 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3217 on 102 degrees of freedom
## Multiple R-squared: 0.7422, Adjusted R-squared: 0.7372
## F-statistic: 146.9 on 2 and 102 DF, p-value: < 2.2e-16
Given the following analysis, I would have some suggestions: * I would advise on choosing houses with a median of $500,000 and an average of 2-3 bedrooms because a rise in bedrooms with small square-feet would actually may have an inverse relationship on the price. That entirely depends on the area, one is choosing to buy their house in.