Task 1: Warmup with (fake) Zillow

Pretend you’re running a housing search

Take a quick look at this Zillow search (1bd-1ba in Athens, GA). In an ideal world, Zillow would not be blocking scrapers, and you could do all the following exercise using that link. However, Zillow is pretty good at detecting simple scrapes and blocking them, so you’re going to practice on the following static page instead of the real Zillow.

Click on this copied zillow page and open it in your browser. (Give it a second to load). This page contains a copy of home search results from 2 years ago. Those were real Zillow search results, I simply copied them onto my personal website. This loses a lot of the functionality, such as the interactive map (which I cannot copy from the zillow servers), but the essentials you need for the exercise are there: there is a list of homes with some pictures, prices, and characteristics. This is what you will scrape.

Scrape the results of this housing search Extract the following elements from the fake page: - Price of the house - Details of the house (bathrooms, bedrooms, square footage, anything else you want) Hint: this is not hard. A few simple commands from the “rvest” package should do the trick.

library(rvest)
library(tidyverse)
library(ggplot2)
library(XML)
library(xml2)
library(jsonlite)
library(stringr)
library(haven)
library(dplyr)
library(reshape2)


get_price <- function(html){
  html%>%
    #the relevant tag
    html_nodes('.list-card-price')%>%
    html_text()%>%
    str_trim()}

get_details <- function(html){
  html%>%
    #the relevant tag
    html_nodes('.list-card-details')%>%
    html_text()%>%
    str_trim()}

house_url <- 'https://www.mfilipski.com/random/zillow.html'
housing <- read_html(house_url)
head(housing)
## $node
## <pointer: 0x112f1a420>
## 
## $doc
## <pointer: 0x112f19c10>
price <- get_price(housing)
details <- get_details(housing)

length(price); length(details)
## [1] 11
## [1] 11

Do some cleaning You will need to split the string that gives bathrooms, bedrooms and square footage.

Note: This might involve some use of Regular Expressions. To avoid a student uprising, here is a hint:

mutate(bedrooms = as.integer(str_trim(str_extract(details, “[\d ]*(?=bds)“))))

The above is a regex that extracts whatever digits are located in front of the string ‘bds’.

tab_housing <- tibble(price, details)
tab_housing
## # A tibble: 11 × 2
##    price    details                            
##    <chr>    <chr>                              
##  1 $174,999 4 bds3 ba1,524 sqft- House for sale
##  2 $210,000 3 bds2 ba1,171 sqft- House for sale
##  3 $479,900 3 bds2 ba2,318 sqft- House for sale
##  4 $330,000 4 bds3 ba2,238 sqft- House for sale
##  5 $290,000 4 bds3 ba2,213 sqft- House for sale
##  6 $319,900 3 bds2 ba-- sqft- House for sale   
##  7 $115,000 2 bds2 ba1,540 sqft- Condo for sale
##  8 $259,900 3 bds2 ba1,582 sqft- Condo for sale
##  9 $225,000 2 bds3 ba-- sqft- Condo for sale   
## 10 $875,000 5 bds5 ba5,701 sqft- House for sale
## 11 $150,000 1 bd1 ba867 sqft- House for sale
housing_2 <- tab_housing %>%
  mutate(bathrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=ba)")))) 

housing_2 <- housing_2  %>%
  mutate(bedrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=bds)")))) 

housing_2 <- housing_2 %>%
  mutate(sqft = str_trim(str_extract(details, "[\\d ,]*(?=sqft)"))) 

housing_2 <- housing_2 %>%
  mutate(sqft = as.numeric(str_replace(sqft,",",""))) 

housing_2 <- housing_2 %>%
  mutate(price = as.numeric(str_replace_all(price,"[^0-9]*","")))

# I created a dummy variable to indicate whether the type of real estate for sale is house or not. 
house <- c(1,1,1,1,1,1,0,0,0,1,1)

house_athens <- tibble(house,housing_2)
house_athens <- subset(house_athens, select = -c(details))
house_athens
## # A tibble: 11 × 5
##    house  price bathrooms bedrooms  sqft
##    <dbl>  <dbl>     <int>    <int> <dbl>
##  1     1 174999         3        4  1524
##  2     1 210000         2        3  1171
##  3     1 479900         2        3  2318
##  4     1 330000         3        4  2238
##  5     1 290000         3        4  2213
##  6     1 319900         2        3    NA
##  7     0 115000         2        2  1540
##  8     0 259900         2        3  1582
##  9     0 225000         3        2    NA
## 10     1 875000         5        5  5701
## 11     1 150000         1       NA   867

Visualization

Make any plot(s) you think are informative about this data

boxplot(house_athens$price, main="Price")

boxplot(house_athens$sqft, main="Square Footage")

Regression analysis

Run a simple OLS command to run the following “hedonic pricing” model: price ~ number of bedrooms, number of bathrooms, square footage (and any other variable you might have scraped, if any) Briefly comment on those results.

reg_house <- lm(price ~ bedrooms + bathrooms + sqft, house_athens)
summary(reg_house)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft, data = house_athens)
## 
## Residuals:
##      1      2      3      4      5      7      8     10 
##  20965  36651  11123  -7935 -41496  -9489 -19308   9489 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -51690.34   52284.74  -0.989  0.37881    
## bedrooms     143901.86   32939.32   4.369  0.01198 *  
## bathrooms   -254137.86   46363.98  -5.481  0.00539 ** 
## sqft            257.57      21.34  12.069  0.00027 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32570 on 4 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.9896, Adjusted R-squared:  0.9819 
## F-statistic: 127.3 on 3 and 4 DF,  p-value: 0.0002007

The result indicated that on average, having an additional bedroom will increase the sales price by $143,902 and is significant at 95% level; having an additional bathroom will decrease the price by $254,137 and is significant at 99% level, one additional square foot will increase the price by $257.57 and is significant at 99.9% level. However, due to the small sample the estimation might not represent the actual real estate price.


# Task 2: Open scraping exercise

Scrape any data from any existing publicly-available website, and run a simple analysis. Any website and analysis are ok, as long as:

• You use actual scraping (not an API, not a simple download). This means you have to parse the data out of HTML code, or something equivalent.

• You gather enough data to generate a plot, or a table, or some summary stats, or something of that effect.

• You don’t do anything illegal (i.e.: Don’t try to scrape the CIA website or gather data that is not public.).

As always, please show your code. Note: You do not need to scrape large amounts of data. This is just for hands-on practice, so if you can scrape a couple of variables and a handful of observations, that is enough. Still, nicer analyses based on more interesting data will get better grades.

# I scraped data about the top 250 TV series from IMDB. 
imdb_url <- read_html("https://www.imdb.com/chart/toptv/")

get_title <- html_nodes(imdb_url,'.titleColumn a') %>% 
  html_text()

get_year <- html_nodes(imdb_url, '.secondaryInfo') %>% 
  html_text()

get_rate <- html_nodes(imdb_url, '.imdbRating') %>% 
  html_text()

head(get_title); head(get_year); head(get_rate)
## [1] "Planet Earth II"  "Breaking Bad"     "Planet Earth"     "Band of Brothers"
## [5] "Chernobyl"        "The Wire"
## [1] "(2016)" "(2008)" "(2006)" "(2001)" "(2019)" "(2002)"
## [1] "\n            9.4\n    " "\n            9.4\n    "
## [3] "\n            9.4\n    " "\n            9.4\n    "
## [5] "\n            9.3\n    " "\n            9.3\n    "
title <- get_title %>% unlist()

year <- get_year %>% 
  str_replace_all(pattern = "[\\(\\)]", replacement = "") %>% 
  unlist() %>% 
  as.numeric()

rating <- get_rate %>% 
  str_replace_all(pattern = "\n", replacement = "") %>% 
  str_trim(side = "both") %>% 
  unlist() %>% 
  as.numeric()

toptv <- tibble(title,year,rating)
head(toptv)
## # A tibble: 6 × 3
##   title             year rating
##   <chr>            <dbl>  <dbl>
## 1 Planet Earth II   2016    9.4
## 2 Breaking Bad      2008    9.4
## 3 Planet Earth      2006    9.4
## 4 Band of Brothers  2001    9.4
## 5 Chernobyl         2019    9.3
## 6 The Wire          2002    9.3
toptv_2000 <- subset (toptv, year > 1999)
toptv_2000 <- subset (toptv_2000, year < 2023)

plot_tv <- ggplot(data = toptv_2000, mapping = aes(x = year)) + 
  geom_histogram (fill = "steelblue2", color = "black", binwidth = 2) + 
  labs(x = "Year", y = "Count", title = "Top 250 TV Series from 2000 to 2022",
       caption = "Data source: https://www.imdb.com/chart/toptv/") + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.caption = element_text(size = 12),
        axis.text.x=element_text(size = 12), 
        axis.text.y=element_text(size = 12), 
        axis.title = element_text(size = 15))
print(plot_tv)