Introduction
This R Markdown document demonstrates data analysis using Yelp POIs data which focusing on shopping and restaurant categories in Valdosta, GA.
This document is broken into 3 parts including
- Data Cleaning
- Data Analysis - Yelp POIs
- Data Analysis - POIs and Census Data
Load packages
library(tidycensus)
library(sf)
library(tmap)
library(jsonlite)
library(tidyverse)
library(httr)
library(jsonlite)
library(reshape2)
library(here)
library(yelpr)
library(knitr)
library(skimr)
Load Saved POIs Data from RDS files
restaurant_all_list <- readRDS("restaurant_yelp_valdosta_ga.rds")
shopping_all_list <- readRDS("shopping_yelp_valdosta_ga.rds")
Convert lists to Dataframes and Combine into one Dataframe
restaurant_poi <- restaurant_all_list %>%
bind_rows() %>%
mutate(main_category = "restaurant") #create a new column to specify main_category
shopping_poi <- shopping_all_list %>%
bind_rows() %>%
mutate(main_category = "shopping")
all_poi <- bind_rows(restaurant_poi, shopping_poi) %>% as_tibble()
1. Data Clearning
See the classes of each column
sapply(all_poi, class) %>% print()
## id alias name image_url is_closed
## "character" "character" "character" "character" "logical"
## url review_count categories rating coordinates
## "character" "integer" "list" "numeric" "data.frame"
## transactions price location phone display_phone
## "list" "character" "data.frame" "character" "character"
## distance business_hours attributes main_category
## "numeric" "list" "data.frame" "character"
Remove duplicates
all_poi_unique <- all_poi %>%
distinct(id, .keep_all = TRUE)
print(paste0("Before dropping duplicated id: ", nrow(all_poi)))
## [1] "Before dropping duplicated id: 6034"
print(paste0("After dropping duplicated id: ", nrow(all_poi_unique)))
## [1] "After dropping duplicated id: 1010"
Drop records without coordinate information
all_poi_nona <- all_poi_unique %>%
filter(!is.na(coordinates$longitude))
print(paste0("Before dropping na: ", nrow(all_poi_unique)))
## [1] "Before dropping na: 1010"
print(paste0("After dropping na: ", nrow(all_poi_nona)))
## [1] "After dropping na: 1010"
There is no record without the coordinate information
Unnest category columns
# since one poi can have more than 1 category, when unnest it will result in a long-format table
df_poi_cat_long <- all_poi_nona %>%
unnest(categories, names_sep="_") # unnest categories and separate sub-columns with "_"
print(paste0("Before unnesting categories: ", nrow(all_poi_nona)))
## [1] "Before unnesting categories: 1010"
print(paste0("After unnesting categories: ", nrow(df_poi_cat_long)))
## [1] "After unnesting categories: 1869"
Extract Coordinates
poi_sf <- all_poi_nona %>%
mutate(x = .$coordinates$longitude,
y = .$coordinates$latitude) %>%
filter(!is.na(x) & !is.na(y)) %>%
st_as_sf(coords = c("x", "y"), crs = 4326)
Load City Polygon
city <- "Valdosta"
state <- "GA"
county <- "Lowndes"
city_polygon <- tigris::places(state) %>%
filter(NAME == city)
## Retrieving data for the year 2022
# Convert CRS of city polygon to EPSG:4326
city_polygon <- city_polygon %>% st_transform(crs=4326)
Remove out of city POIs
# Convert CRS of city polygon to EPSG:4326
city_polygon <- city_polygon %>% st_transform(crs=4326)
city_poi_sf <- poi_sf[city_polygon, ]
print(paste0("All Business POIs: ", nrow(poi_sf)))
## [1] "All Business POIs: 1010"
print(paste0("Business POIs in the City Boundary: ", nrow(city_poi_sf)))
## [1] "Business POIs in the City Boundary: 708"
Only 708 from 1,010 POIs are in the city boundary
Filter out the POIs that are not in the city boundary in long-format table
print(paste0("Number of records before removing out of city POIs: ", nrow(df_poi_cat_long)))
## [1] "Number of records before removing out of city POIs: 1869"
df_poi_cat_long <- df_poi_cat_long %>%
semi_join(city_poi_sf, by="id")
print(paste0("Number of records after removing out of city POIs: ", nrow(df_poi_cat_long)))
## [1] "Number of records after removing out of city POIs: 1329"
2. Data Analysis - Yelp POIs Data
Visualize POIs distribution on Maps
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(city_poi_sf) +
tm_dots(col = "main_category", size = "rating", palette="Set2",
alpha=0.7, scale=1, size.max=50, id="name") +
tm_shape(city_polygon) + tm_borders()
## Legend for symbol sizes not available in view mode.
The interactive map shows locations of business pois where the green bubbles represent restaurant businesses and orange bubbles are shopping businesses. The size of bubbles indicates rating that the business gets from Yelp’s users. When hover the mouse over the point, the name of the business will appear.
From the map above, we can see that there are 2 main clusters of business POIs. The POIs in the first clusters are located along the highway number 41 which passes across the middle of the city. The second cluster is on the left which is dense near the Valdosta Mall and is nearby the I-75 highway.
Number of POIs by Category
# Prep data for visualize bar charts
df_poi_cnt_by_cat <- df_poi_cat_long %>%
group_by(main_category, categories_title) %>%
summarise(num_pois = n_distinct(id)) %>%
ungroup()
## `summarise()` has grouped output by 'main_category'. You can override using the
## `.groups` argument.
df_restaurant_cnt <- df_poi_cnt_by_cat %>% filter(main_category == "restaurant")
df_shopping_cnt <- df_poi_cnt_by_cat %>% filter(main_category == "shopping")
Some stats about the number of POIs by category
skim(df_poi_cnt_by_cat)
Name | df_poi_cnt_by_cat |
Number of rows | 255 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
main_category | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
categories_title | 0 | 1 | 4 | 32 | 0 | 248 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
num_pois | 0 | 1 | 5.21 | 8.01 | 1 | 1 | 2 | 6 | 75 | ▇▁▁▁▁ |
ggplot(df_restaurant_cnt) +
geom_histogram(aes(x = num_pois), bins = 20) +
labs(title = "Histogram POIs in Restaurant Category", x = "Number of POIs", y = "Count") +
theme_minimal()
ggplot(df_shopping_cnt) +
geom_histogram(aes(x = num_pois), bins = 20) +
labs(title = "Histogram POIs in Shopping Category", x = "Number of POIs", y = "Count") +
theme_minimal()
From the stats and histogram, there are many categories that have small
number. Let take a look at top 10 categories by business type.
Restaurants
ggplot(df_restaurant_cnt %>%
arrange(desc(num_pois)) %>%
slice(1:10),
aes(y = reorder(categories_title, num_pois),
x = num_pois)) +
geom_bar(stat = "identity") +
labs(title = "Number of POIs in Restaurant Category", x = "Categories", y = "Number of POIs") +
theme_minimal()
The top 10 of restaurant business POIs are shown above. I notice that there is a parks category which might be due to restaurants also provide parking lots for customers so in Yelp’s data they also record them like this. In further analysis, we should remove this category to prevent misunderstanding.
Shopping
ggplot(df_shopping_cnt %>%
arrange(desc(num_pois)) %>%
slice(1:10),
aes(y = reorder(categories_title, num_pois),
x = num_pois)) +
geom_bar(stat = "identity") +
labs(title = "Number of POIs in Shopping Category", x = "Categories", y = "Number of POIs") +
theme_minimal()
For the shopping business POIs, the top most is woman’s clothing which more than 40 points out of 398 shopping POIs.
Ratings and Price
Let’s take a look at the rating and price of the POIs ### Distribution of rating by each business category
city_poi_restaurant_sf <- city_poi_sf %>% filter(main_category == "restaurant")
city_poi_shopping_sf <- city_poi_sf %>% filter(main_category == "shopping")
# Restaurant
ggplot(city_poi_restaurant_sf) +
geom_histogram(aes(x = rating), bins = 20) +
labs(title = "Rating Distribution of Restaurant Business POIs") +
theme_minimal()
# Shopping
ggplot(city_poi_shopping_sf) +
geom_histogram(aes(x = rating), bins = 20) +
labs(title = "Rating Distribution of Shopping Business POIs") +
theme_minimal()
Distribution of rating by price
# Restaurant
ggplot(city_poi_restaurant_sf %>% filter(!is.na(price))) +
geom_density(aes(x = rating, fill = price), alpha = 0.3) +
labs(title = "Rating Distribution of Restaurant Business POIs by Price") +
theme_minimal()
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
# Shopping
ggplot(city_poi_shopping_sf %>% filter(!is.na(price))) +
geom_density(aes(x = rating, fill = price), alpha = 0.3) +
labs(title = "Rating Distribution of Shopping Business POIs by Price") +
theme_minimal()
For the restaurant POIs, there are less than 2 data points for
$$$
price so the plot is not shown. For $
price restaurant, the rating distribution is concentrated in the lower
range of the rating scale, peaking between 1.5 and 2.5. This indicates
that lower-priced restaurants tend to receive lower ratings overall. For
the middle price tier $$
, the density shows a peak between
ratings of 2.5 and 3. This suggests that moderately priced restaurants
are more likely to receive higher ratings compared to the lowest-priced
category.
For the shopping POIs, Similar to the restaurant category, shopping
businesses with higher prices $$$
tend to receive more
favorable ratings, showing a strong association between price and
quality perception. Moderately priced shopping businesses also tend to
receive favorable ratings, but with a slightly more diverse rating
distribution compared to the highest price tier.
3. Data Analysis - POIs and Census Tract Data
Take a look at Correlation to Household Income and Population
Load Household Income at Census Tract Level
tract_data <- suppressMessages(
get_acs(geography = 'tract',
state = state,
county = county,
variables = c(hhincome = "B19013_001",
pop = "B01003_001"),
year = 2022,
survey = 'acs5',
geometry = TRUE,
output = 'wide'))
## Warning: • You have not set a Census API key. Users without a key are limited to 500
## queries per day and may experience performance limitations.
## ℹ For best results, get a Census API key at
## http://api.census.gov/data/key_signup.html and then supply the key to the
## `census_api_key()` function to use it throughout your tidycensus session.
## This warning is displayed once per session.
Convert Tract CRS to EPSG4326
epsg_id <- 4326
tract_data <- tract_data %>%
st_transform(crs = epsg_id)
Filter POIs by each tract
tract_pois <- st_join(tract_data, city_poi_sf, join=st_intersects)
print(paste0("Number of all POIs: ", nrow(city_poi_sf)))
## [1] "Number of all POIs: 708"
print(paste0("Number of records after spatial join: ", nrow(tract_pois)))
## [1] "Number of records after spatial join: 717"
There are some tracts don’t have POIs.
Remove tracts that don’t have POIs
tract_pois_clean <- tract_pois %>%
filter(!is.na(id))
print(paste0("Number of records after cleaning: ", nrow(tract_pois_clean)))
## [1] "Number of records after cleaning: 708"
Aggregate rating and count number of POIs group by business category and tract
tract_pois_clean_agg <- tract_pois_clean %>%
st_drop_geometry() %>%
group_by(GEOID, main_category, hhincomeE, popE) %>%
summarise(num_pois = n_distinct(id), med_rating = median(rating), avg_rating = mean(rating)) %>%
ungroup()
## `summarise()` has grouped output by 'GEOID', 'main_category', 'hhincomeE'. You
## can override using the `.groups` argument.
Plot Scatter Plots between Household Income, Population and rating and number of POIs
ggplot(tract_pois_clean_agg, aes(x=hhincomeE, y=avg_rating, color=main_category)) +
geom_point() +
labs(title="Household Income vs Avg Rating")+
theme_minimal()
ggplot(tract_pois_clean_agg, aes(x=hhincomeE, y=num_pois, color=main_category)) +
geom_point() +
labs(title="Household Income vs Number of POIs")+
theme_minimal()
ggplot(tract_pois_clean_agg, aes(x=popE, y=avg_rating, color=main_category)) +
geom_point() +
labs(title="Population vs Avg Rating")+
theme_minimal()
ggplot(tract_pois_clean_agg, aes(x=popE, y=num_pois, color=main_category)) +
geom_point() +
labs(title="Population vs Number of POIs")+
theme_minimal()
From 4 scatter plots above there is no association between the household
income to avg. rating and number of POIs, and between population to avg.
rating and number of POIs. Because the sample size is small (only 19
census tracts) and the POIs categories may not be relevant. Some others
POIs categories may exhibit relationship between these variables.
Conclusion
In this R document, the Yelp POIs data of restaurant and shopping categories in Valdosta, GA are analyzed. The distributions in term of location are visualized in the interactive map and indicate the major two clusters of POIs located along the main roads.
The top 10 restaurant and shopping categories are shown. The number one restaurant category in term of number is fast food and for shopping category is woman’s clothes.
The rating and price are also analyzed, there significant differences
between distribution of rating of low ($
), medium
($$
), and high ($$$
) tier price. While the
distributions of low tier are approximately uniform, the higher tiers
indicate observable peaks.
Lastly, the relationship between household income, population, avg. rating and number of POIs are considered. Unfortunately there is no trend between them.