HW10

#install.packages("rvest")
#Note: There is no title in the bar plot so I updated it when 4/24.

Task 1: Warmup with (fake) Zillow

Scrape the results of this housing search

library(rvest)
library(tidyverse)
library(ggplot2)

#Specify the url
url <- "https://www.mfilipski.com/random/zillow"

#Read the html
webpage <-read_html(url)

#Extract the price of the house
price_read <- html_nodes(webpage, ".list-card-price")
price <- html_text(price_read)
head(price)

## [1] "$174,999" "$210,000" "$479,900" "$330,000" "$290,000" "$319,900"

#Extract the details of the house
details_read <- html_nodes(webpage, ".list-card-details")
details <- html_text(details_read)
head(details)

## [1] "4 bds3 ba1,524 sqft- House for sale" "3 bds2 ba1,171 sqft- House for sale"
## [3] "3 bds2 ba2,318 sqft- House for sale" "4 bds3 ba2,238 sqft- House for sale"
## [5] "4 bds3 ba2,213 sqft- House for sale" "3 bds2 ba-- sqft- House for sale"

Do some cleaning

# Make the dataframe
data <- data.frame(details = details)

# split  the  string  that  gives  bathrooms,  bedrooms  and  square  footage.
data <- data %>%
  mutate(bedrooms = as.integer(str_trim(str_extract(details, "\\d+(?=\\s*bds)"))),
         bathrooms = as.integer(str_trim(str_extract(details, "\\d+(?=\\s*ba)"))),
         sqft = str_trim(str_extract(details, "\\d+[\\d,]*(?=\\s*sqft)")))

Visualization

#Make the price variable numeric
data$sqft <- as.numeric(str_replace(data$sqft, ",", ""))
data$price <- price
data$price_n <- as.numeric(str_replace_all(data$price, "[^0-9.]", ""))

#Scatter plot between price and each factor of details
par(mfrow=c(2,2))
plot(data$bedrooms, data$price_n, main="Scatterplot between price and bedrooms",
   xlab="bedrooms ", ylab="price ")
plot(data$bathrooms, data$price_n, main="Scatterplot between price and bathrooms",
   xlab="bathrooms ", ylab="price ")
plot(data$sqft, data$price_n, main="Scatterplot between price and sqrt",
   xlab="sqft ", ylab="price ")

Regression analysis

reg <- lm(price_n ~ bedrooms + bathrooms + sqft, data = data)

summary(reg)

## 
## Call:
## lm(formula = price_n ~ bedrooms + bathrooms + sqft, data = data)
## 
## Residuals:
##      1      2      3      4      5      7      8     10 
##  20965  36651  11123  -7935 -41496  -9489 -19308   9489 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -51690.34   52284.74  -0.989  0.37881    
## bedrooms     143901.86   32939.32   4.369  0.01198 *  
## bathrooms   -254137.86   46363.98  -5.481  0.00539 ** 
## sqft            257.57      21.34  12.069  0.00027 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32570 on 4 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.9896, Adjusted R-squared:  0.9819 
## F-statistic: 127.3 on 3 and 4 DF,  p-value: 0.0002007

The coefficient for the number of bedrooms variable is statistically significant, indicating that as the number of bedrooms increases by one, the housing price is expected to increase by an average of $143,901.86, while keeping other factors constant.

The coefficient for the number of bathrooms variable is significant, indicating that as the number of bathrooms increases by one, the housing price is expected to decrease by an average of $254,137.86, while keeping other factors constant.

The coefficient for the square footage variable is significant, indicating that as the square footage increases by one, the housing price is expected to increase by an average of $257.57, while keeping other factors constant.

Task 2: Open scraping exercise

#Specify Dr. Jeffrey M. Wooldridge's google scholar url
url2 <- "https://scholar.google.com/citations?user=faE3_ksAAAAJ&hl=en&oi=sra"

#Read the html
webpage2 <-read_html(url2)

#Extract citation
cite_read <- html_nodes(webpage2, ".gsc_a_c a")
cite <- html_text(cite_read)

#Make dataframe
cite <- data.frame(cite = cite)
cite <- subset(cite,cite != "*")

#Extract year 
year_read <- html_nodes(webpage2, ".gsc_a_y span")
year<- html_text(year_read)

#Make dataframe
year <- data.frame(year = year)
year <- subset(year,year != "Year")

#Extract the title
title_read <- html_nodes(webpage2, ".gsc_a_at")
title<- html_text(title_read)
head(title)

## [1] "Econometric analysis of cross section and panel data"                                                         
## [2] "Introductory econometrics: A modern approach"                                                                 
## [3] "Recent developments in the econometrics of program evaluation"                                                
## [4] "Econometric methods for fractional response variables with an application to 401 (k) plan participation rates"
## [5] "A capital asset pricing model with time-varying covariances"                                                  
## [6] "Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances"

#Merge and Make final dataset 
data2 <- data.frame(title = title)
data2$cite<- cite$cite
data2$year<- year$year

head(data2)

##                                                                                                           title
## 1                                                          Econometric analysis of cross section and panel data
## 2                                                                  Introductory econometrics: A modern approach
## 3                                                 Recent developments in the econometrics of program evaluation
## 4 Econometric methods for fractional response variables with an application to 401 (k) plan participation rates
## 5                                                   A capital asset pricing model with time-varying covariances
## 6             Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances
##    cite year
## 1 55788 2010
## 2 27694 2015
## 3  5918 2009
## 4  4659 1996
## 5  4648 1988
## 6  4483 1992

#make the variables numeric
data2$cite <- as.numeric(data2$cite)
data2$year <- as.numeric(data2$year)

#make bar plot
barplot(tapply(data2$cite, data2$year, FUN=sum),las = 2,
        cex.names = 1) +title(xlab="year", main="Citation of Dr. Wooldrige' papers in the first page of Google Scholar")

## numeric(0)

I would like to know how many of Dr. Jeffrey M. Wooldridge’s papers that I respect have been cited. On the first page showing the top 20 papers that have been mostly cited, I would like to see which year of papers was cited the most. The paper published in 2010, “Econometric Analysis of Cross Section and Panel Data,” has the highest number of citations.