Yelp POIs Data Analysis - Restaurant and Shopping Categories

CP8883 Intro to Urban Analytics Fall 2024 - Mini Assignment 2

Thanawit Suwannikom

2024-09-26

Introduction

This R Markdown document demonstrates data analysis using Yelp POIs data which focusing on shopping and restaurant categories in Valdosta, GA.

This document is broken into 3 parts including

  1. Data Cleaning
  2. Data Analysis - Yelp POIs
  3. Data Analysis - POIs and Census Data

Load packages

library(tidycensus)
library(sf)
library(tmap)
library(jsonlite)
library(tidyverse)
library(httr)
library(jsonlite)
library(reshape2)
library(here)
library(yelpr)
library(knitr)
library(skimr)

Load Saved POIs Data from RDS files

restaurant_all_list <- readRDS("restaurant_yelp_valdosta_ga.rds")
shopping_all_list <- readRDS("shopping_yelp_valdosta_ga.rds")

Convert lists to Dataframes and Combine into one Dataframe

restaurant_poi <- restaurant_all_list %>% 
                  bind_rows() %>%
                  mutate(main_category = "restaurant") #create a new column to specify main_category

shopping_poi <- shopping_all_list %>% 
                bind_rows() %>%
                mutate(main_category = "shopping")

all_poi <- bind_rows(restaurant_poi, shopping_poi) %>% as_tibble()

1. Data Clearning

See the classes of each column

sapply(all_poi, class) %>% print()
##             id          alias           name      image_url      is_closed 
##    "character"    "character"    "character"    "character"      "logical" 
##            url   review_count     categories         rating    coordinates 
##    "character"      "integer"         "list"      "numeric"   "data.frame" 
##   transactions          price       location          phone  display_phone 
##         "list"    "character"   "data.frame"    "character"    "character" 
##       distance business_hours     attributes  main_category 
##      "numeric"         "list"   "data.frame"    "character"

Remove duplicates

all_poi_unique <- all_poi %>%
                  distinct(id, .keep_all = TRUE)
print(paste0("Before dropping duplicated id: ", nrow(all_poi)))
## [1] "Before dropping duplicated id: 6034"
print(paste0("After dropping duplicated id: ", nrow(all_poi_unique)))
## [1] "After dropping duplicated id: 1010"

Drop records without coordinate information

all_poi_nona <- all_poi_unique %>% 
  filter(!is.na(coordinates$longitude))
print(paste0("Before dropping na: ", nrow(all_poi_unique)))
## [1] "Before dropping na: 1010"
print(paste0("After dropping na: ", nrow(all_poi_nona)))
## [1] "After dropping na: 1010"

There is no record without the coordinate information

Unnest category columns

# since one poi can have more than 1 category, when unnest it will result in a long-format table
df_poi_cat_long <- all_poi_nona %>%
  unnest(categories, names_sep="_")  # unnest categories and separate sub-columns with "_"

print(paste0("Before unnesting categories: ", nrow(all_poi_nona)))
## [1] "Before unnesting categories: 1010"
print(paste0("After unnesting categories: ", nrow(df_poi_cat_long)))
## [1] "After unnesting categories: 1869"

Extract Coordinates

poi_sf <- all_poi_nona %>% 
  mutate(x = .$coordinates$longitude,
         y = .$coordinates$latitude) %>% 
  filter(!is.na(x) & !is.na(y)) %>% 
  st_as_sf(coords = c("x", "y"), crs = 4326)

Load City Polygon

city <- "Valdosta"
state <- "GA"
county <- "Lowndes"

city_polygon <- tigris::places(state) %>% 
            filter(NAME == city)
## Retrieving data for the year 2022
# Convert CRS of city polygon to EPSG:4326
city_polygon <- city_polygon %>% st_transform(crs=4326)

Remove out of city POIs

# Convert CRS of city polygon to EPSG:4326
city_polygon <- city_polygon %>% st_transform(crs=4326)

city_poi_sf <- poi_sf[city_polygon, ]

print(paste0("All Business POIs: ", nrow(poi_sf)))
## [1] "All Business POIs: 1010"
print(paste0("Business POIs in the City Boundary: ", nrow(city_poi_sf)))
## [1] "Business POIs in the City Boundary: 708"

Only 708 from 1,010 POIs are in the city boundary

Filter out the POIs that are not in the city boundary in long-format table

print(paste0("Number of records before removing out of city POIs: ", nrow(df_poi_cat_long)))
## [1] "Number of records before removing out of city POIs: 1869"
df_poi_cat_long <- df_poi_cat_long %>%
  semi_join(city_poi_sf, by="id")

print(paste0("Number of records after removing out of city POIs: ", nrow(df_poi_cat_long)))
## [1] "Number of records after removing out of city POIs: 1329"

2. Data Analysis - Yelp POIs Data

Visualize POIs distribution on Maps

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(city_poi_sf) +
  tm_dots(col = "main_category", size = "rating", palette="Set2",
          alpha=0.7, scale=1, size.max=50, id="name") +
  tm_shape(city_polygon) + tm_borders()
## Legend for symbol sizes not available in view mode.

The interactive map shows locations of business pois where the green bubbles represent restaurant businesses and orange bubbles are shopping businesses. The size of bubbles indicates rating that the business gets from Yelp’s users. When hover the mouse over the point, the name of the business will appear.

From the map above, we can see that there are 2 main clusters of business POIs. The POIs in the first clusters are located along the highway number 41 which passes across the middle of the city. The second cluster is on the left which is dense near the Valdosta Mall and is nearby the I-75 highway.

Number of POIs by Category

# Prep data for visualize bar charts
df_poi_cnt_by_cat <- df_poi_cat_long %>%
  group_by(main_category, categories_title) %>%
  summarise(num_pois = n_distinct(id)) %>%
  ungroup()
## `summarise()` has grouped output by 'main_category'. You can override using the
## `.groups` argument.
df_restaurant_cnt <- df_poi_cnt_by_cat %>% filter(main_category == "restaurant")
df_shopping_cnt <- df_poi_cnt_by_cat %>% filter(main_category == "shopping")

Some stats about the number of POIs by category

skim(df_poi_cnt_by_cat)
Data summary
Name df_poi_cnt_by_cat
Number of rows 255
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
main_category 0 1 8 10 0 2 0
categories_title 0 1 4 32 0 248 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
num_pois 0 1 5.21 8.01 1 1 2 6 75 ▇▁▁▁▁
ggplot(df_restaurant_cnt) +
  geom_histogram(aes(x = num_pois), bins = 20) + 
  labs(title = "Histogram POIs in Restaurant Category", x = "Number of POIs", y = "Count") +
  theme_minimal()

ggplot(df_shopping_cnt) +
  geom_histogram(aes(x = num_pois), bins = 20) + 
  labs(title = "Histogram POIs in Shopping Category", x = "Number of POIs", y = "Count") +
  theme_minimal()

From the stats and histogram, there are many categories that have small number. Let take a look at top 10 categories by business type.

Restaurants

ggplot(df_restaurant_cnt %>% 
         arrange(desc(num_pois)) %>% 
         slice(1:10), 
       aes(y = reorder(categories_title, num_pois), 
           x = num_pois)) +
  geom_bar(stat = "identity") + 
  labs(title = "Number of POIs in Restaurant Category", x = "Categories", y = "Number of POIs") +
  theme_minimal()

The top 10 of restaurant business POIs are shown above. I notice that there is a parks category which might be due to restaurants also provide parking lots for customers so in Yelp’s data they also record them like this. In further analysis, we should remove this category to prevent misunderstanding.

Shopping

ggplot(df_shopping_cnt %>% 
         arrange(desc(num_pois)) %>% 
         slice(1:10), 
       aes(y = reorder(categories_title, num_pois), 
           x = num_pois)) +
  geom_bar(stat = "identity") + 
  labs(title = "Number of POIs in Shopping Category", x = "Categories", y = "Number of POIs") +
  theme_minimal()

For the shopping business POIs, the top most is woman’s clothing which more than 40 points out of 398 shopping POIs.

Ratings and Price

Let’s take a look at the rating and price of the POIs ### Distribution of rating by each business category

city_poi_restaurant_sf <- city_poi_sf %>% filter(main_category == "restaurant")
city_poi_shopping_sf <- city_poi_sf %>% filter(main_category == "shopping")

# Restaurant
ggplot(city_poi_restaurant_sf) +
  geom_histogram(aes(x = rating), bins = 20) +
  labs(title = "Rating Distribution of Restaurant Business POIs") +
  theme_minimal()

# Shopping
ggplot(city_poi_shopping_sf) +
  geom_histogram(aes(x = rating), bins = 20) +
  labs(title = "Rating Distribution of Shopping Business POIs") +
  theme_minimal()

Distribution of rating by price

# Restaurant
ggplot(city_poi_restaurant_sf %>% filter(!is.na(price))) +
  geom_density(aes(x = rating, fill = price), alpha = 0.3) +
  labs(title = "Rating Distribution of Restaurant Business POIs by Price") +
  theme_minimal()
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

# Shopping
ggplot(city_poi_shopping_sf %>% filter(!is.na(price))) +
  geom_density(aes(x = rating, fill = price), alpha = 0.3) +
  labs(title = "Rating Distribution of Shopping Business POIs by Price") +
  theme_minimal()

For the restaurant POIs, there are less than 2 data points for $$$ price so the plot is not shown. For $ price restaurant, the rating distribution is concentrated in the lower range of the rating scale, peaking between 1.5 and 2.5. This indicates that lower-priced restaurants tend to receive lower ratings overall. For the middle price tier $$, the density shows a peak between ratings of 2.5 and 3. This suggests that moderately priced restaurants are more likely to receive higher ratings compared to the lowest-priced category.

For the shopping POIs, Similar to the restaurant category, shopping businesses with higher prices $$$ tend to receive more favorable ratings, showing a strong association between price and quality perception. Moderately priced shopping businesses also tend to receive favorable ratings, but with a slightly more diverse rating distribution compared to the highest price tier.

3. Data Analysis - POIs and Census Tract Data

Take a look at Correlation to Household Income and Population

Load Household Income at Census Tract Level

tract_data <- suppressMessages(
  get_acs(geography = 'tract',
          state = state,
          county = county,
          variables = c(hhincome = "B19013_001",
                        pop = "B01003_001"),
          year = 2022,
          survey = 'acs5', 
          geometry = TRUE,
          output = 'wide'))
## Warning: • You have not set a Census API key. Users without a key are limited to 500
## queries per day and may experience performance limitations.
## ℹ For best results, get a Census API key at
## http://api.census.gov/data/key_signup.html and then supply the key to the
## `census_api_key()` function to use it throughout your tidycensus session.
## This warning is displayed once per session.

Convert Tract CRS to EPSG4326

epsg_id <- 4326
tract_data <- tract_data %>% 
  st_transform(crs = epsg_id)

Filter POIs by each tract

tract_pois <- st_join(tract_data, city_poi_sf, join=st_intersects)
print(paste0("Number of all POIs: ", nrow(city_poi_sf)))
## [1] "Number of all POIs: 708"
print(paste0("Number of records after spatial join: ", nrow(tract_pois)))
## [1] "Number of records after spatial join: 717"

There are some tracts don’t have POIs.

Remove tracts that don’t have POIs

tract_pois_clean <- tract_pois %>%
  filter(!is.na(id))

print(paste0("Number of records after cleaning: ", nrow(tract_pois_clean)))
## [1] "Number of records after cleaning: 708"

Aggregate rating and count number of POIs group by business category and tract

tract_pois_clean_agg <- tract_pois_clean %>%
  st_drop_geometry() %>%
  group_by(GEOID, main_category, hhincomeE, popE) %>%
  summarise(num_pois = n_distinct(id), med_rating = median(rating), avg_rating = mean(rating)) %>%
  ungroup()
## `summarise()` has grouped output by 'GEOID', 'main_category', 'hhincomeE'. You
## can override using the `.groups` argument.

Plot Scatter Plots between Household Income, Population and rating and number of POIs

ggplot(tract_pois_clean_agg, aes(x=hhincomeE, y=avg_rating, color=main_category)) +
  geom_point() +
  labs(title="Household Income vs Avg Rating")+
  theme_minimal()

ggplot(tract_pois_clean_agg, aes(x=hhincomeE, y=num_pois, color=main_category)) +
  geom_point() +
  labs(title="Household Income vs Number of POIs")+
  theme_minimal()

ggplot(tract_pois_clean_agg, aes(x=popE, y=avg_rating, color=main_category)) +
  geom_point() +
  labs(title="Population vs Avg Rating")+
  theme_minimal()

ggplot(tract_pois_clean_agg, aes(x=popE, y=num_pois, color=main_category)) +
  geom_point() +
  labs(title="Population vs Number of POIs")+
  theme_minimal()

From 4 scatter plots above there is no association between the household income to avg. rating and number of POIs, and between population to avg. rating and number of POIs. Because the sample size is small (only 19 census tracts) and the POIs categories may not be relevant. Some others POIs categories may exhibit relationship between these variables.

Conclusion

In this R document, the Yelp POIs data of restaurant and shopping categories in Valdosta, GA are analyzed. The distributions in term of location are visualized in the interactive map and indicate the major two clusters of POIs located along the main roads.

The top 10 restaurant and shopping categories are shown. The number one restaurant category in term of number is fast food and for shopping category is woman’s clothes.

The rating and price are also analyzed, there significant differences between distribution of rating of low ($), medium ($$), and high ($$$) tier price. While the distributions of low tier are approximately uniform, the higher tiers indicate observable peaks.

Lastly, the relationship between household income, population, avg. rating and number of POIs are considered. Unfortunately there is no trend between them.