I am running a housing search. For example, take a quick look at this Zillow website. I an ideal world, Zillow would not be blocking scrapers, and you could do all the following exercise using that link. However, Zillow is pretty good at detecting simple scrapes and blocking them, so we’re going to practice on the following static page instead of the real Zillow.
Click on this copied zillow page. Give it a second to load. This page contains a copy of home search results from 2 years ago. Those were real Zillow search results. This loses a lot of the functionality, such as the interactive map, but the essentials I need for the exercise are there: there is a list of homes with some pictures, prices, and characteristics. This is what we will scrape for this exercise.
Extracting the following elements from the fake page: - Price of the house - Details of the house (bathrooms, bedrooms, square footage, anything else you want)
library(rvest)
library(tidyverse)
zillow_url <- 'https://www.mfilipski.com/random/zillow'
zillow_html <- read_html(zillow_url)
head(zillow_html)## $node
## <pointer: 0x107a1c6e0>
##
## $doc
## <pointer: 0x107a1bed0>
# create the function to get prices, number of beds, baths, etc
# function to obtain prices from the tag 'list-card-price'
get_price <- function(html){
html %>%
html_nodes('.list-card-price') %>%
html_text() %>%
str_trim()
}
get_details <- function(html){
html %>%
html_nodes('.list-card-details') %>%
html_text() %>%
str_trim()
}
price <- get_price(zillow_html)
price <- data.frame(price)
detail <- get_details(zillow_html)
detail <- data.frame(detail)
#There are altogether 11 observations.Details have three components bathrooms, bedrooms and square footage (string), need to split those into data points. This involves some use of Regular Expressions. The above is a regex that extracts whatever digits are located in front of the string ‘bds’.
library(dplyr)
library(gsubfn)
df <- cbind(price, detail)
#remove $ and comma from prices
df[]<-lapply(df,gsub,pattern="$",fixed=TRUE,replacement="")
df[]<-lapply(df,gsub,pattern=",",fixed=TRUE,replacement="")
#using regular expressions like bd, ba and sqft lets split the data
df <- df %>%
mutate(bedrooms = as.integer(str_trim(str_extract(df$detail, "[\\d ]*(?=bd)")))) %>%
mutate(bathrooms = as.integer(str_trim(str_extract(df$detail, "[\\d ]*(?=ba)")))) %>%
mutate(sqft = as.integer(str_trim(str_extract(df$detail, "[\\d ]*(?=sqft)"))))
#also create the new variable: house or condo
df$house_condo <- str_extract(df$detail, pattern = "House")
df$house_condo[is.na(df$house_condo)] <- "Condo"
#create dummy: house=1 and 0 otherwise
df$house <- ifelse(df$house_condo == 'House', 1, 0)
# we have the data we need for analysislibrary(ggplot2)
figure <- ggplot(df, aes(x=sqft), height = 400, width = 7) +
theme_classic()+
geom_point(aes(x=sqft, y=price, colour='sqft'), size = 2)+
scale_color_manual(name = "",
values = c("sqft" = "red"))+ #colour manual
xlab("Square Foot") + ylab("Price in Dollars")+
ggtitle("Figure: Price vs. Square Foot")
figure With the
increase in square foot, the price of the house seems to increase, as
expected. The relation is pretty much linear. Lets see if we can find
the marginal effect of each square foot.
Simple OLS command to run the following “hedonic pricing” model: price ~ number of bedrooms, number of bathrooms, square footage (and any other variable we have scraped, if any)
# regression
model <- lm(price ~ bedrooms + bathrooms + sqft + house, data = df)
#tabulate the regression results:
library(stargazer)
stargazer(model,
type="text",
align=TRUE,
no.space=TRUE,
column.labels=c("log of cereal production"),
covariate.labels = c("Number of Bedrooms", "Number of Bathrooms", "Sq. Foot", "House=1"),
title="Hedonic Analysis")##
## Hedonic Analysis
## ===============================================
## Dependent variable:
## ---------------------------
## price
## log of cereal production
## -----------------------------------------------
## Number of Bedrooms 99,979.210**
## (24,150.180)
## Number of Bathrooms -228,336.400***
## (39,405.040)
## Sq. Foot 255.082***
## (18.491)
## House=1 55,339.140*
## (24,150.280)
## Constant -4,007.755
## (30,381.080)
## -----------------------------------------------
## Observations 9
## R2 0.993
## Adjusted R2 0.986
## Residual Std. Error 28,155.600 (df = 4)
## F Statistic 138.470*** (df = 4; 4)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Comment: All (except the bathroom coefficient) seems to have the results as expected. With the increase in bedrooms or house size, the price increases, i.e., larger the house, higher the cost.. Makes Sense. More specifically, with increase in one more room, increases the price by $99979, significant at 5 percent. And, with increase in every square feet, the price raises by $255. Also, the price for house is expensive than that for condominiums. There are only 9 observations because this is just a simple web scrapping exercise.
Scraping data from any existing publicly-available website, and running a simple analysis.
I am wondering if the prices are higher if the good is rated higher. I will need the information on consumer ratings and the market price of the good. Let’s use Amazon’s listing on the Study Chair to scrap the data and have it analyzed.
# Scraping data from Amazon
library(rvest)
library(tidyverse)
amazon_url <- 'https://www.amazon.com/s?k=chair&crid=DG992UODKKDY&sprefix=chair%2Caps%2C110&ref=nb_sb_noss_1'
amazon_html <- read_html(amazon_url)
head(amazon_html)## $node
## <pointer: 0x10e40dfc0>
##
## $doc
## <pointer: 0x10e40dad0>
# create the function to get prices, number of beds, baths, etc
# function to obtain prices from the tag 'list-card-price'
get_price_amazon <- function(html){
html %>%
html_nodes('.a-price-whole') %>%
html_text() %>%
str_trim()
}
get_primeicon <- function(html){
html %>%
html_nodes('.a-icon a-icon-prime a-icon-medium') %>%
html_text() %>%
str_trim()
}
get_stars <- function(html){
html %>%
html_nodes('.a-icon-alt') %>%
html_text() %>%
str_trim()
}
price <- get_price_amazon(amazon_html)
price <- data.frame(price)
stars <- get_stars(amazon_html)
stars <- data.frame(stars)
stars <- head(stars, -6)
#combine into a dataset
df2 <- cbind(price, stars)library(dplyr)
library(gsubfn)
#remove "." or "," from prices
df2[,1]<-lapply(df2,gsub,pattern=".",fixed=TRUE,replacement="")
df2[]<-lapply(df2,gsub,pattern=",",fixed=TRUE,replacement="")
#using regular expressions like 'out of' to split the required data
df2 <- df2 %>%
mutate(back = as.integer(str_trim(str_extract(df2$stars, "[\\d ]*(?=out)"))))
df2$front <- as.numeric(gsub("([0-9]+).*$", "\\1", df2$stars))
df2$star_value <- paste(df2$front, df2$back, sep = '.', collapse=NULL) #concatenate
# we have our data with 82 observationslibrary(ggplot2)
figure <- ggplot(df2, height = 400, width = 7) +
theme_classic()+
geom_point(aes(y=price, x=star_value, colour='sqft'), size = 2)+
scale_color_manual(name = "",
values = c("sqft" = "red"))+ #colour manual
xlab("Rating Stars") + ylab("Price in Dollars")+
ggtitle("Figure: Price vs. Rating Stars")
figure From the
plot we can observe that there is no detectable pattern that signifies
the relationship between price and rating stars.
model <- lm(price ~ factor(star_value), data = df2)
#tabulate the regression results:
library(stargazer)
stargazer(model,
type="text",
align=TRUE,
no.space=TRUE,
title="Simple Regression")##
## Simple Regression
## =================================================
## Dependent variable:
## ---------------------------
## price
## -------------------------------------------------
## factor(star_value)3.9 -14.000
## (266.250)
## factor(star_value)4.0 -21.000
## (230.579)
## factor(star_value)4.1 72.333
## (203.352)
## factor(star_value)4.2 131.900
## (197.456)
## factor(star_value)4.3 18.476
## (192.697)
## factor(star_value)4.4 97.000
## (193.725)
## factor(star_value)4.5 12.462
## (195.374)
## factor(star_value)4.6 12.300
## (197.456)
## factor(star_value)4.7 30.000
## (266.250)
## factor(star_value)4.9 40.000
## (266.250)
## factor(star_value)5.0 -21.000
## (266.250)
## Constant 59.000
## (188.267)
## -------------------------------------------------
## Observations 84
## R2 0.067
## Adjusted R2 -0.076
## Residual Std. Error 188.267 (df = 72)
## F Statistic 0.469 (df = 11; 72)
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As mentioned above, higher rating stars given by past purchasers do not significantly affect the market price of the product. Also, the rating stars should not be thought as the proxy for market demand. Note that it is just a simple illustrative example of web scraping. It should not be considered as the real analysis.