Task 1: Warmup with (fake) Zillow

# loading required packages 
library(rvest)
library(dplyr)
library(stringr)

# Set the website URL
url <- "https://www.mfilipski.com/random/zillow"

# Read the HTML content of the website
web_content <- read_html(url)

# Scrape the data
prices <- web_content %>% html_nodes(".list-card-price") 
House_details <- web_content %>% html_nodes(".list-card-details") 

# creating a data frame
details_frame<-html_text(House_details)%>%
  data.frame(t(sapply(House_details, unlist)))

head(details_frame)

##                                     .                      node
## 1 4 bds3 ba1,524 sqft- House for sale <pointer: 0x7fc01b0abd90>
## 2 3 bds2 ba1,171 sqft- House for sale <pointer: 0x7fc01b0b1550>
## 3 3 bds2 ba2,318 sqft- House for sale <pointer: 0x7fc01b1673e0>
## 4 4 bds3 ba2,238 sqft- House for sale <pointer: 0x7fc01b16d440>
## 5 4 bds3 ba2,213 sqft- House for sale <pointer: 0x7fc01b1726a0>
## 6    3 bds2 ba-- sqft- House for sale <pointer: 0x7fc01b1778e0>
##                         doc
## 1 <pointer: 0x7fc01b14c7b0>
## 2 <pointer: 0x7fc01b14c7b0>
## 3 <pointer: 0x7fc01b14c7b0>
## 4 <pointer: 0x7fc01b14c7b0>
## 5 <pointer: 0x7fc01b14c7b0>
## 6 <pointer: 0x7fc01b14c7b0>

# Clean the data and create a data frame with the prices
clean_df <- details_frame %>%
  mutate(
    bedrooms = as.integer(str_trim(str_extract(html_text(House_details), "[\\d ]*(?=bds)"))),
    bathrooms = as.integer(str_trim(str_extract(html_text(House_details), "\\d+(?=\\s*ba)"))),
    sqft = str_trim(str_extract(html_text(House_details), "\\d+[\\d,]*(?=\\s*sqft)")),
    sqft = as.numeric(str_replace(sqft, ",", ""))
  )

price_num <- as.numeric(gsub("[^0-9.]+", "", prices)) # clean the price data by removing the "$" sign
clean_df$Price = price_num  # add the cleaned price column to the data frame

clean_df <- select(clean_df, -., -node, -doc) # delete the unwanted columns

Visualization

This graph below create a scatterplot with square footage on the x-axis and price on the y-axis. Each point represents a property, and the position of the point indicates its square footage and price. This can give an idea of the relationship between the two variables, such as whether larger properties tend to be more expensive. The linear fit also suggest that larger properties tend to be more expensive.

library(ggplot2)

# Create a scatterplot of Price vs. sqft
ggplot(data = clean_df, aes(x = sqft, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)+
  labs(x = "Square footage", y = "Price")

Regression analysis

# Load the required library
library(dplyr)

# Run a linear regression model with the first difference variables
model <- lm(Price ~ bedrooms+bathrooms+sqft, data = clean_df)

# load the stargazer package
library(stargazer)

stargazer(model, type="text",
          no.space=TRUE, keep.stat = c("n","rsq"),
          title = "OLS Regression ",
          covariate.labels = c("bedrooms", "bathrooms", "Squarefoot"), dep.var.labels = "Price")

## 
## OLS Regression
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         Price           
## ----------------------------------------
## bedrooms            143,901.900**       
##                     (32,939.320)        
## bathrooms          -254,137.900***      
##                     (46,363.980)        
## Squarefoot           257.566***         
##                       (21.341)          
## Constant             -51,690.340        
##                     (52,284.740)        
## ----------------------------------------
## Observations              8             
## R2                      0.990           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

The results in the above table suggest that number of bedrooms have a positive and significant effect on the price of a property. Shockingly, the table suggest that the number of bathrooms in a property have a negative and significant effect on the price of a property. This finding seems surprising and may not reflect how the market works. Anyway, our sample is really small and inference may not be valid in this scenario.

Task 2: Open scraping exercise

In this task, I will scrape the top 10 movies on IMDb website along with their ratings, creates a data frame, and then plots a bar chart showing the ratings for each movie.

# Load required packages
library(rvest)
library(dplyr)
library(ggplot2)

# Set the URL to scrape
url <- "https://www.imdb.com/chart/top/"

# Read the HTML content from the URL
web_content <- read_html(url)

# Scrape the movie titles and ratings
movie_titles <- web_content %>% html_nodes(".titleColumn a") %>% html_text(trim = TRUE)
ratings <- web_content %>% html_nodes(".ratingColumn strong") %>% html_text(trim = TRUE) %>% as.numeric()

# Keep only the top 10 movies
movie_titles <- movie_titles[1:10]
ratings <- ratings[1:10]

# Create a data frame
top_movies <- data.frame(Title = movie_titles, Rating = ratings)

# Print the data frame
print(top_movies)

##                                                Title Rating
## 1                           The Shawshank Redemption    9.2
## 2                                      The Godfather    9.2
## 3                                    The Dark Knight    9.0
## 4                              The Godfather Part II    9.0
## 5                                       12 Angry Men    9.0
## 6                                   Schindler's List    8.9
## 7      The Lord of the Rings: The Return of the King    8.9
## 8                                       Pulp Fiction    8.8
## 9  The Lord of the Rings: The Fellowship of the Ring    8.8
## 10                    The Good, the Bad and the Ugly    8.8

# Plot the data
ggplot(top_movies, aes(x = reorder(Title, Rating), y = Rating, fill = Title)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Movie Title", y = "Rating", title = "Top 10 Movies on IMDb")

The graph shows that “The Shawshank Redemption” Movie had the highest ratings among the top 10 movies. With Pulp Fiction having the lowest ratings.

Web Scrapping

Godwin Nutsugah

2023-04-23

Task 1: Warmup with (fake) Zillow

Visualization

Regression analysis

Task 2: Open scraping exercise