Introduction

For my final project I decided to create a webscraper that will go through some pages on Yelp in 4 difference cities: Los Angeles, New York City, Riverside, and San Francisco. I understand that yelp has a powerful API, and it would probably work for a better analysis, but I used an API for my data 606 project and I wanted to mix it up and challenge myself a bit. With the data I want to answer a few questions like, which city has the highest rated food? Which city has the most reviews? Is the price correlated with the score?

Data Aquisition

Here is the set up for the web scraper. I select 4 links for each city. Notice at the end of the url in the query paramter it reads “start=”. This is how I will paginate through each page. with page starting at 0 and incrementing by 10 until I get to 240 (24 pages).

link <- c("https://www.yelp.com/search?find_desc=Restaurants&find_loc=Los%20Angeles%2C%20CA&start=",
          "https://www.yelp.com/search?find_desc=&find_loc=San%20Francisco%2C%20CA&start=", 
          "https://www.yelp.com/search?find_desc=&find_near=new-york-city-new-york-14&start=",
          "https://www.yelp.com/search?cflt=restaurants&find_loc=Riverside%2C%20CA&start=")

city <- c("Los Angeles, CA", "San Francisco, CA", "New York City, NY", "Riverside, CA")
 
df <- data.frame(matrix(ncol=6,nrow=0))

colnames(df) <- c('name', 'price', 'number_of_reviews', 'rating', 'area', 'city') 

This is a nested for loop, the first looping only 4 times through each link and the second going through all the pages in the link. I’m using a library called rvest to scrape each value I want. Each chunk of code is pulling a list that I will combine into a dataframe which will then be appended to a larger dataframe.

for (i in 1:length(link)) {
  
  page<-0
  
  for (x in 1:24) {
    
    content <- read_html(paste(link[i], as.character(page), sep=""))
    page <- page + 10
    
    name <- content %>%
      html_nodes(".css-1pxmz4g") %>%
      html_text()
    
    price_and_genre <- content %>%
      html_nodes(".css-1j7sdmt") %>%
      html_text()
    
    price <- gsub("[^\\$]+","", price_and_genre)
    food_type <- gsub("[\\$]+","", price_and_genre)
    
    
    number_of_reviews <- content %>%
      html_nodes(".reviewCount__09f24__EUXPN") %>%
      html_text()
    
    rating <- content %>%
      html_nodes(".overflow--hidden__09f24__3z7CX") %>%
      html_attr("aria-label") 
    
    rating <- rating[!is.na(rating)]
    
    area <- content %>%
      html_nodes(".text-align--right__09f24__2OpQD") %>%
      html_text() 

    Sys.sleep(2)
    
    temp_df = data.frame(name, price, number_of_reviews, rating, area, stringsAsFactors = FALSE)
    
    temp_df$city = city[i]
    
    df = rbind(df, temp_df)
  }
}

Data Exploration

Here we are cleaning up the data set slightly by removing " start rating" from the string and converting the column into a double. Also, we convert the review count to an integer.

df$rating <- as.double(gsub(" star rating","",df$rating))
df$number_of_reviews <- as.integer(df$number_of_reviews)

No lets look at some stats by city. I want to know how has the highest average review rating and which has the most total reviews. Out of the 4 areas Los Angeles has the highest and Riverside has the lowest.

city_stats <- df %>%
  group_by(city) %>%
  summarise(
    average_rating = mean(rating),
    total_reviews = sum(number_of_reviews))

head(city_stats)
## # A tibble: 4 x 3
##   city              average_rating total_reviews
##   <chr>                      <dbl>         <int>
## 1 Los Angeles, CA             4.30        262726
## 2 New York City, NY           4.11         74382
## 3 Riverside, CA               4.02         84323
## 4 San Francisco, CA           4.24        520749

Interestingly, San Francisco has the highest amount of review by far. My hypothesis is because Yelp is a San Francisco company so the city might have adopted it earlier than other cities.

ggplot(data=city_stats, aes(x=reorder(city, total_reviews),y=total_reviews)) + geom_bar(stat="identity") + coord_flip()

Data Analysis

df$price_int <- as.integer(ifelse(df$price == "$$$$", 4,
       ifelse(df$price == "$$$", 3,
       ifelse(df$price == "$$", 2,
       ifelse(df$price == "$", 1,"")   
       ))))

df %>%
  group_by(price) %>%
  summarise(count = n()) %>%
  ggplot(aes(x=reorder(price,count),y=count)) + geom_bar(stat="identity") + coord_flip()

There are some restaurants that don’t have a price indicator, so we are filtering those out. We want to plot rating vs price. My hypothesis is the pricier the restraunt the the more likely the rating will be higher on average. In the plot we use the jitter function to add some variability.

df2 <- df %>%
  filter(price_int >= 1) 

ggplot(data=df2, aes(x=price_int, y=rating)) + 
  geom_point() + 
  geom_jitter() + 
  geom_smooth(formula=rating ~ price_int, method = "lm")
## Warning: Computation failed in `stat_smooth()`:
## object 'price_int' not found

So looking at the model it’s actually slightly negatively correlated. But, if you look at the Pr(>|t|) value it shows that these results are not significant. We can’t really make any conclusions on this model.

model <- lm (rating ~ price_int, df2)
summary(model)
## 
## Call:
## lm(formula = rating ~ price_int, data = df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.646 -0.122 -0.122  0.378  0.878 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.17066    0.04456  93.588   <2e-16 ***
## price_int   -0.02435    0.02217  -1.098    0.273    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3812 on 825 degrees of freedom
## Multiple R-squared:  0.001459,   Adjusted R-squared:  0.000249 
## F-statistic: 1.206 on 1 and 825 DF,  p-value: 0.2725
ggplot(data=df2, aes(x=number_of_reviews, y=rating)) + 
  geom_point() + 
  geom_jitter() + 
  geom_smooth(formula=rating ~ price_int, method = "lm")
## Warning: Computation failed in `stat_smooth()`:
## object 'price_int' not found

Conclusion

In the future I would use the API to collect more data from more place in order to build a stronger model around the relationship between price and rating. However I did enjoy building the webscraping script. Furthermore we did learn a few things, like how San Francisco has by far the most ratings.