Task 1: Warmup with (fake) Zillow

I am running a housing search. For example, take a quick look at this Zillow website. I an ideal world, Zillow would not be blocking scrapers, and you could do all the following exercise using that link. However, Zillow is pretty good at detecting simple scrapes and blocking them, so we’re going to practice on the following static page instead of the real Zillow.

Click on this copied zillow page. Give it a second to load. This page contains a copy of home search results from 2 years ago. Those were real Zillow search results. This loses a lot of the functionality, such as the interactive map, but the essentials I need for the exercise are there: there is a list of homes with some pictures, prices, and characteristics. This is what we will scrape for this exercise.

Scraping the results of this housing search

Extracting the following elements from the fake page: - Price of the house - Details of the house (bathrooms, bedrooms, square footage, anything else you want)

library(rvest)
library(tidyverse)

zillow_url <- 'https://www.mfilipski.com/random/zillow'
zillow_html <- read_html(zillow_url)
head(zillow_html)

## $node
## <pointer: 0x107a1c6e0>
## 
## $doc
## <pointer: 0x107a1bed0>

# create the function to get prices, number of beds, baths, etc
# function to obtain prices from the tag 'list-card-price'
get_price <- function(html){
  html %>%
    html_nodes('.list-card-price') %>%
    html_text() %>% 
    str_trim()
  }

get_details <- function(html){
  html %>%
    html_nodes('.list-card-details') %>%
    html_text() %>% 
    str_trim()
  }

price <- get_price(zillow_html)
price <- data.frame(price)
detail <- get_details(zillow_html)
detail <- data.frame(detail)
#There are altogether 11 observations.

Some cleaning

Details have three components bathrooms, bedrooms and square footage (string), need to split those into data points. This involves some use of Regular Expressions. The above is a regex that extracts whatever digits are located in front of the string ‘bds’.

library(dplyr)
library(gsubfn)
df <- cbind(price, detail)

#remove $ and comma from prices
df[]<-lapply(df,gsub,pattern="$",fixed=TRUE,replacement="")
df[]<-lapply(df,gsub,pattern=",",fixed=TRUE,replacement="")

#using regular expressions like bd, ba and sqft lets split the data
df <- df %>%
  mutate(bedrooms = as.integer(str_trim(str_extract(df$detail, "[\\d ]*(?=bd)")))) %>%
  mutate(bathrooms = as.integer(str_trim(str_extract(df$detail, "[\\d ]*(?=ba)")))) %>%
  mutate(sqft = as.integer(str_trim(str_extract(df$detail, "[\\d ]*(?=sqft)"))))

#also create the new variable: house or condo
df$house_condo <- str_extract(df$detail, pattern = "House")
df$house_condo[is.na(df$house_condo)] <- "Condo"

#create dummy: house=1 and 0 otherwise
df$house <- ifelse(df$house_condo == 'House', 1, 0)

# we have the data we need for analysis

Visualization

library(ggplot2)
figure <- ggplot(df, aes(x=sqft), height = 400, width = 7) + 
  theme_classic()+
  geom_point(aes(x=sqft, y=price, colour='sqft'), size = 2)+
  scale_color_manual(name = "", 
                     values = c("sqft" = "red"))+          #colour manual
  xlab("Square Foot") + ylab("Price in Dollars")+
  ggtitle("Figure: Price vs. Square Foot")

figure

With the increase in square foot, the price of the house seems to increase, as expected. The relation is pretty much linear. Lets see if we can find the marginal effect of each square foot.

Regression analysis

Simple OLS command to run the following “hedonic pricing” model: price ~ number of bedrooms, number of bathrooms, square footage (and any other variable we have scraped, if any)

# regression 
model <- lm(price ~ bedrooms + bathrooms + sqft + house, data = df)

#tabulate the regression results:
library(stargazer)
stargazer(model,
           type="text",
           align=TRUE,
           no.space=TRUE,
           column.labels=c("log of cereal production"),
           covariate.labels = c("Number of Bedrooms", "Number of Bathrooms", "Sq. Foot", "House=1"),
           title="Hedonic Analysis")

## 
## Hedonic Analysis
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                price           
##                      log of cereal production  
## -----------------------------------------------
## Number of Bedrooms         99,979.210**        
##                            (24,150.180)        
## Number of Bathrooms       -228,336.400***      
##                            (39,405.040)        
## Sq. Foot                    255.082***         
##                              (18.491)          
## House=1                     55,339.140*        
##                            (24,150.280)        
## Constant                    -4,007.755         
##                            (30,381.080)        
## -----------------------------------------------
## Observations                     9             
## R2                             0.993           
## Adjusted R2                    0.986           
## Residual Std. Error     28,155.600 (df = 4)    
## F Statistic           138.470*** (df = 4; 4)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Comment: All (except the bathroom coefficient) seems to have the results as expected. With the increase in bedrooms or house size, the price increases, i.e., larger the house, higher the cost.. Makes Sense. More specifically, with increase in one more room, increases the price by $99979, significant at 5 percent. And, with increase in every square feet, the price raises by $255. Also, the price for house is expensive than that for condominiums. There are only 9 observations because this is just a simple web scrapping exercise.

Task 2: Open scraping exercise

Scraping data from any existing publicly-available website, and running a simple analysis.

I am wondering if the prices are higher if the good is rated higher. I will need the information on consumer ratings and the market price of the good. Let’s use Amazon’s listing on the Study Chair to scrap the data and have it analyzed.

# Scraping data from Amazon
library(rvest)
library(tidyverse)
amazon_url <- 'https://www.amazon.com/s?k=chair&crid=DG992UODKKDY&sprefix=chair%2Caps%2C110&ref=nb_sb_noss_1'

amazon_html <- read_html(amazon_url)
head(amazon_html)

## $node
## <pointer: 0x10e40dfc0>
## 
## $doc
## <pointer: 0x10e40dad0>

# create the function to get prices, number of beds, baths, etc
# function to obtain prices from the tag 'list-card-price'
get_price_amazon <- function(html){
  html %>%
    html_nodes('.a-price-whole') %>%
    html_text() %>% 
    str_trim()
  }

get_primeicon <- function(html){
  html %>%
    html_nodes('.a-icon a-icon-prime a-icon-medium') %>%
    html_text() %>% 
    str_trim()
  }

get_stars <- function(html){
  html %>%
    html_nodes('.a-icon-alt') %>%
    html_text() %>% 
    str_trim()
  }


price <- get_price_amazon(amazon_html)
price <- data.frame(price)

stars <- get_stars(amazon_html)
stars <- data.frame(stars)
stars <- head(stars, -6)

#combine into a dataset
df2 <- cbind(price, stars)

Repeating similar exercise as above:

library(dplyr)
library(gsubfn)

#remove "." or "," from prices
df2[,1]<-lapply(df2,gsub,pattern=".",fixed=TRUE,replacement="")
df2[]<-lapply(df2,gsub,pattern=",",fixed=TRUE,replacement="")

#using regular expressions like 'out of' to split the required data
df2 <- df2 %>%
  mutate(back = as.integer(str_trim(str_extract(df2$stars, "[\\d ]*(?=out)"))))
df2$front <- as.numeric(gsub("([0-9]+).*$", "\\1", df2$stars))
df2$star_value <- paste(df2$front, df2$back, sep = '.', collapse=NULL)  #concatenate

# we have our data with 82 observations

Visualization and Simple Regression:

library(ggplot2)
figure <- ggplot(df2, height = 400, width = 7) + 
  theme_classic()+
  geom_point(aes(y=price, x=star_value, colour='sqft'), size = 2)+
  scale_color_manual(name = "", 
                     values = c("sqft" = "red"))+          #colour manual
  xlab("Rating Stars") + ylab("Price in Dollars")+
  ggtitle("Figure: Price vs. Rating Stars")

figure

From the plot we can observe that there is no detectable pattern that signifies the relationship between price and rating stars.

model <- lm(price ~ factor(star_value), data = df2)

#tabulate the regression results:
library(stargazer)
stargazer(model,
           type="text",
           align=TRUE,
           no.space=TRUE,
           title="Simple Regression")

## 
## Simple Regression
## =================================================
##                           Dependent variable:    
##                       ---------------------------
##                                  price           
## -------------------------------------------------
## factor(star_value)3.9           -14.000          
##                                (266.250)         
## factor(star_value)4.0           -21.000          
##                                (230.579)         
## factor(star_value)4.1           72.333           
##                                (203.352)         
## factor(star_value)4.2           131.900          
##                                (197.456)         
## factor(star_value)4.3           18.476           
##                                (192.697)         
## factor(star_value)4.4           97.000           
##                                (193.725)         
## factor(star_value)4.5           12.462           
##                                (195.374)         
## factor(star_value)4.6           12.300           
##                                (197.456)         
## factor(star_value)4.7           30.000           
##                                (266.250)         
## factor(star_value)4.9           40.000           
##                                (266.250)         
## factor(star_value)5.0           -21.000          
##                                (266.250)         
## Constant                        59.000           
##                                (188.267)         
## -------------------------------------------------
## Observations                      84             
## R2                               0.067           
## Adjusted R2                     -0.076           
## Residual Std. Error        188.267 (df = 72)     
## F Statistic               0.469 (df = 11; 72)    
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01

As mentioned above, higher rating stars given by past purchasers do not significantly affect the market price of the product. Also, the rating stars should not be thought as the proxy for market demand. Note that it is just a simple illustrative example of web scraping. It should not be considered as the real analysis.

HW10, 8610 - Web Scraping

metricsdawg

2023-04-21