For this assignment, I used zillow website to analyse two + bed one + bath housing in Athens, GA. The results were spread in 3 pages so, I scraped three times in a loop.
# Make helpful functions
# 1. Price of the house
get_price <- function(html){
html %>%
# The relevant tag
html_nodes('.list-card-price') %>%
html_text() %>%
str_trim()
}
# 2. Details of the house
get_details <- function(html){
html %>%
# The relevant tag
html_nodes('.list-card-details') %>%
html_text() %>%
str_trim()
}
# 3. Address of the house
get_addr <- function(html){
html %>%
# The relevant tag
html_nodes('.list-card-addr') %>%
html_text() %>%
str_trim()
}
# Loop for 3 pages
price <- c()
details <- c()
addr <- c()
for(i in 1:3){
if (i == 1){
zillow_url <- "https://www.zillow.com/athens-ga/2-_beds/1.0-_baths/?searchQueryState=%7B%22usersSearchTerm%22%3A%22Athens%2C%20GA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-83.80297762207032%2C%22east%22%3A-82.99960237792969%2C%22south%22%3A33.72954602872356%2C%22north%22%3A34.18631216297458%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A23534%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22beds%22%3A%7B%22min%22%3A2%7D%2C%22baths%22%3A%7B%22min%22%3A2%7D%2C%22sort%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D"
}
else{
zillow_url <- paste("https://www.zillow.com/athens-ga/2-_beds/1.0-_baths/",i,"_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22Athens%2C%20GA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-83.80297762207032%2C%22east%22%3A-82.99960237792969%2C%22south%22%3A33.72954602872356%2C%22north%22%3A34.18631216297458%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A23534%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22beds%22%3A%7B%22min%22%3A2%7D%2C%22baths%22%3A%7B%22min%22%3A2%7D%2C%22sort%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D",sep = "")
}
zillow_html <- read_html(zillow_url)
price <- c(price, get_price(zillow_html))
details <- c(details, get_details(zillow_html))
addr <- c(addr, get_addr(zillow_html))
}
# Check if lengths agree
length(price); length(details); length(addr)
## [1] 27
## [1] 27
## [1] 27
# Look at the table
tab_house <- tibble(price, details, addr)
head(tab_house)
## # A tibble: 6 x 3
## price details addr
## <chr> <chr> <chr>
## 1 $324,900 3 bds2 ba2,245 sqft- House for sale 717 Weeping Willow Dr, Athens,~
## 2 $225,000 3 bds2 ba1,400 sqft- House for sale 195 Chatham Dr, Athens, GA 306~
## 3 $415,000 3 bds2 ba2,766 sqft- House for sale 56 Reese Ridge Dr, Athens, GA ~
## 4 $240,000 4 bds2 ba1,328 sqft- House for sale 398 Arch St, Athens, GA 30601
## 5 $1,149,000 4 bds4 ba-- sqft- House for sale 240 Morton Ave, Athens, GA 306~
## 6 $535,000 4 bds3 ba2,301 sqft- House for sale 150 Jane Cir, Athens, GA 30606
#library(stringr)
zillow_df <- data.frame(matrix(nrow = 27, ncol = 0)) #create a dataframe
# cleaning the file
zillow_df2 <- zillow_df %>%
mutate(bedrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=bds)")))) %>%
mutate(bathrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=ba)")))) %>%
mutate(sqft = str_trim(str_extract(details, "[\\d ,]*(?=sqft)"))) %>%
mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
mutate(price = as.numeric(str_replace_all(price,"[^0-9]*","")))
# Here is the dataframe
head(zillow_df2)
## bedrooms bathrooms sqft price
## 1 3 2 2245 324900
## 2 3 2 1400 225000
## 3 3 2 2766 415000
## 4 4 2 1328 240000
## 5 4 4 NA 1149000
## 6 4 3 2301 535000
# Plot
p <- ggplot(zillow_df2, aes(x = sqft, y = price, size = sqft, color = as.factor(bedrooms))) +
geom_point()
p
The plot shows a linear relationship between size and price of the house, however there are some outliers. The price of a house increases with an increase in the size.
modOLS <- lm(price ~ bedrooms + bathrooms + sqft, zillow_df2)
summary(modOLS)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft, data = zillow_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1141139 -201958 -41180 178297 1091578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 765693.0 631557.6 1.212 0.239492
## bedrooms -472553.2 196094.3 -2.410 0.025711 *
## bathrooms 42895.6 181868.0 0.236 0.815939
## sqft 535.9 137.4 3.901 0.000888 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 567900 on 20 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.6303, Adjusted R-squared: 0.5749
## F-statistic: 11.37 on 3 and 20 DF, p-value: 0.0001437
The regression result shows that the price of the house is positively and significantly related to the size (sqft) and negatively and significantly related to the number of bedrooms in a house. The coefficient of sqft is 535.9 and is significant at 1%. It is interpreted as on an average, a house with one more sqft cost 535.9 dollars more. Similarly, the coefficient of bedrooms is -472553.2 and is significant at 5% significant level. So, on average, having an additional bedroom in a house would decrease price of the house by roughly -472553.2 dollars. Number of bathrooms donโt have any significant impact on the housing price.
The R-square is 0.6303 which implies that 63.03% of the variation in the housing prices is explained by the model.
We could include some variables defining location of property, neighborhood characteristics, year of house construction, amenities, environmental characteristics and many more.
Distance from the nearby open space or lumber prices based on the demand and supply of wood.
Distance from the nearby open space can be calculated using GIS or R if we can the location of the house and lumber prices can be obtained from private timber products price compilers including TimberMArt South.