We first load the data from the RData file.
load("yelp_data_williamsburg.RData")
yelp_data <- yelp_combined
Inspect the data.
head(yelp_data)
## Simple feature collection with 6 features and 19 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.72461 ymin: 37.24495 xmax: -76.6585 ymax: 37.2944
## Geodetic CRS: WGS 84
## # A tibble: 6 × 20
## id alias name image_url is_closed url review_count categories rating
## <chr> <chr> <chr> <chr> <lgl> <chr> <int> <list> <dbl>
## 1 k_aTyW6m… maur… Maur… https://… FALSE http… 702 <df> 4.3
## 2 acF2X_gj… food… Food… https://… FALSE http… 3331 <df> 4.5
## 3 kgCthurC… wayp… Wayp… https://… FALSE http… 247 <df> 4.3
## 4 zWWMR1xI… old-… Old … https://… FALSE http… 521 <df> 4.3
## 5 bBVhv4O3… casa… Casa… https://… FALSE http… 258 <df> 4.6
## 6 Zy848sW5… seco… Seco… https://… FALSE http… 1599 <df> 4.4
## # ℹ 11 more variables: coordinates <df[,2]>, transactions <list>, price <chr>,
## # location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## # business_hours <list>, attributes <df[,4]>, category <chr>,
## # geometry <POINT [°]>
num_rows <- nrow(yelp_data)
print(num_rows)
## [1] 100
str(yelp_data$coordinates[[1]])
## num [1:100] 37.2 37.3 37.2 37.3 37.3 ...
str(yelp_data$coordinates[[2]]) # Check the structure of the first element
## num [1:100] -76.7 -76.7 -76.7 -76.7 -76.7 ...
yelp_data <- yelp_data %>%
mutate(
latitude = coordinates[[1]], # Extract the first vector (latitude)
longitude = coordinates[[2]] # Extract the second vector (longitude)
)
head(yelp_data)
## Simple feature collection with 6 features and 21 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.72461 ymin: 37.24495 xmax: -76.6585 ymax: 37.2944
## Geodetic CRS: WGS 84
## # A tibble: 6 × 22
## id alias name image_url is_closed url review_count categories rating
## <chr> <chr> <chr> <chr> <lgl> <chr> <int> <list> <dbl>
## 1 k_aTyW6m… maur… Maur… https://… FALSE http… 702 <df> 4.3
## 2 acF2X_gj… food… Food… https://… FALSE http… 3331 <df> 4.5
## 3 kgCthurC… wayp… Wayp… https://… FALSE http… 247 <df> 4.3
## 4 zWWMR1xI… old-… Old … https://… FALSE http… 521 <df> 4.3
## 5 bBVhv4O3… casa… Casa… https://… FALSE http… 258 <df> 4.6
## 6 Zy848sW5… seco… Seco… https://… FALSE http… 1599 <df> 4.4
## # ℹ 13 more variables: coordinates <df[,2]>, transactions <list>, price <chr>,
## # location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## # business_hours <list>, attributes <df[,4]>, category <chr>,
## # geometry <POINT [°]>, latitude <dbl>, longitude <dbl>
Now, let’s tidy the data:
- Delete duplicated rows.
yelp_data <- yelp_data %>%
distinct()
- Flatten nested columns.
if ("category" %in% names(yelp_data)) {
yelp_data <- yelp_data %>%
unnest(cols = category)
}
- Remove rows with missing coordinates.
# Remove rows where either latitude or longitude is missing (i.e., NA values)
yelp_data <- yelp_data%>%
filter(!is.na(latitude) & !is.na(longitude))
- Filter businesses within a specific city boundary of Williamsburg.
# Filter rows based on longitude and latitude
yelp_data <- yelp_data %>%
filter(longitude > -76.75 & longitude < -76.3,
latitude > 37.2 & latitude < 37.4)
# View the filtered data
head(yelp_data)
## Simple feature collection with 6 features and 21 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.72461 ymin: 37.24495 xmax: -76.6585 ymax: 37.2944
## Geodetic CRS: WGS 84
## # A tibble: 6 × 22
## id alias name image_url is_closed url review_count categories rating
## <chr> <chr> <chr> <chr> <lgl> <chr> <int> <list> <dbl>
## 1 k_aTyW6m… maur… Maur… https://… FALSE http… 702 <df> 4.3
## 2 acF2X_gj… food… Food… https://… FALSE http… 3331 <df> 4.5
## 3 kgCthurC… wayp… Wayp… https://… FALSE http… 247 <df> 4.3
## 4 zWWMR1xI… old-… Old … https://… FALSE http… 521 <df> 4.3
## 5 bBVhv4O3… casa… Casa… https://… FALSE http… 258 <df> 4.6
## 6 Zy848sW5… seco… Seco… https://… FALSE http… 1599 <df> 4.4
## # ℹ 13 more variables: coordinates <df[,2]>, transactions <list>, price <chr>,
## # location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## # business_hours <list>, attributes <df[,4]>, category <chr>,
## # geometry <POINT [°]>, latitude <dbl>, longitude <dbl>
Count the rows that the dataset have now.
num_rows_cleaned <- nrow(yelp_data)
print(num_rows_cleaned)
## [1] 98
I am going to perform an additional wrangling to help with the storytelling related to price.
# Group by price if the column exists and categorize into groups
if ("price" %in% names(yelp_data)) {
yelp_data <- yelp_data %>%
mutate(price_group = case_when(
price == "$" ~ "Cheap",
price == "$$" ~ "Moderate",
price == "$$$" ~ "Expensive",
price == "$$$$" ~ "Very Expensive",
TRUE ~ "Unknown"
))
}
Now, summarize ratings and their relation to review counts.
if ("rating" %in% names(yelp_data) && "review_count" %in% names(yelp_data)) {
rating_summary <- yelp_data %>%
group_by(rating) %>%
summarise(
avg_reviews = mean(review_count, na.rm = TRUE),
total_reviews = sum(review_count, na.rm = TRUE),
business_count = n()
) %>%
arrange(desc(business_count))
print(rating_summary)
}
## Simple feature collection with 22 features and 4 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -76.74666 ymin: 37.23163 xmax: -76.64509 ymax: 37.3281
## Geodetic CRS: WGS 84
## # A tibble: 22 × 5
## rating avg_reviews total_reviews business_count geometry
## <dbl> <dbl> <int> <int> <MULTIPOINT [°]>
## 1 4.4 615. 7998 13 ((-76.67594 37.26842), (-76.…
## 2 4.1 473. 4734 10 ((-76.68775 37.27389), (-76.…
## 3 4.2 235. 2348 10 ((-76.69757 37.26888), (-76.…
## 4 4.5 583. 5251 9 ((-76.65928 37.24632), (-76.…
## 5 4 389 3112 8 ((-76.69136 37.27167), (-76.…
## 6 3.8 594. 4156 7 ((-76.69542 37.27128), (-76.…
## 7 4.3 296 1776 6 ((-76.68554 37.24923), (-76.…
## 8 3.4 308. 1538 5 ((-76.70069 37.26825), (-76.…
## 9 3.9 678 3390 5 ((-76.70695 37.26983), (-76.…
## 10 4.6 138 690 5 ((-76.67712 37.26805), (-76.…
## # ℹ 12 more rows
I am going to perform the analysis on price and rating distribution
# Price and rating distribution analysis
if ("price_group" %in% names(yelp_data) && "rating" %in% names(yelp_data)) {
price_rating_summary <- yelp_data %>%
group_by(price_group, rating) %>%
summarise(
avg_reviews = mean(review_count, na.rm = TRUE),
total_reviews = sum(review_count, na.rm = TRUE),
business_count = n()
) %>%
arrange(price_group, desc(rating))
print(price_rating_summary)
}
## `summarise()` has grouped output by 'price_group'. You can override using the
## `.groups` argument.
## Simple feature collection with 36 features and 5 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -76.74666 ymin: 37.23163 xmax: -76.64509 ymax: 37.3281
## Geodetic CRS: WGS 84
## # A tibble: 36 × 6
## # Groups: price_group [4]
## price_group rating avg_reviews total_reviews business_count
## <chr> <dbl> <dbl> <int> <int>
## 1 Cheap 4 623 623 1
## 2 Expensive 4.7 29 29 1
## 3 Expensive 4.5 448. 1790 4
## 4 Expensive 4.4 206 412 2
## 5 Expensive 4.3 247 247 1
## 6 Expensive 4.2 31 31 1
## 7 Expensive 4.1 606 1212 2
## 8 Expensive 4 328. 655 2
## 9 Expensive 3.8 394 788 2
## 10 Expensive 3.7 13 13 1
## # ℹ 26 more rows
## # ℹ 1 more variable: geometry <GEOMETRY [°]>
# Plot distribution of ratings by price group
if ("rating" %in% names(yelp_data) && "price_group" %in% names(yelp_data)) {
ggplot(yelp_data, aes(x = rating, fill = price_group)) +
geom_bar() +
labs(title = "Distribution of Ratings by Price Group",
x = "Rating",
y = "Count of Businesses",
fill = "Price Group") +
theme_minimal()
}
I saw that the business categorized as ‘cheap’ only takes a very small
portion, so I decided to check on the number of it.
# Count the number of "Cheap" businesses
cheap_count <- yelp_data %>%
filter(price_group == "Cheap") %>%
nrow()
# Display the count
print(cheap_count)
## [1] 1
Save the cleaned data just in case we need it for future analysis.
save(yelp_data, file = "tidy_yelp_data_williamsburg.RData")
With these analysis, I would be able to tell a short story about the
findings.
cheap_count_untidy <- yelp_data %>%
filter(price_group == "Cheap") %>%
nrow()
# Print the count of "Cheap" businesses before tidying
print(paste("Cheap businesses in untidy data:", cheap_count_untidy))
## [1] "Cheap businesses in untidy data: 1"
Here’s the short story:
After tidying the Yelp dataset, I noticed that the number of records decreased from 100 to 98. This reduction can likely be attributed to the removal of incomplete or duplicate entries during the data cleaning process. Specifically, rows with missing latitude or longitude coordinates were filtered out, as well as any duplicates that may have existed in the original dataset. This slight decrease of 2 rows suggests that only a small portion of the data was affected by tidying, and the majority of the information remained intact and usable for analysis.
The analysis revealed that most businesses fall within the “Moderate” and “Expensive” price groups. A distribution plot shows that businesses, particularly in these two price categories, cluster around a 4-star rating. This suggests that businesses in Williamsburg with moderate or higher prices generally maintain strong ratings, with a large concentration of businesses rated between 3.5 and 4.5 stars. Notably, even though price groups like “Cheap” and “Unknown” are underrepresented, the businesses within these groups still receive ratings comparable to higher-priced businesses. This reinforces the idea that price does not directly determine quality or customer satisfaction, as businesses across all price groups can achieve similar ratings.
Another key finding is the relatively low presence of “Cheap” businesses in the dataset, with only one such business identified, both before and after tidying the data. This suggests that Williamsburg may cater more to middle- and higher-income consumers, as the majority of businesses fall into the “Moderate” and “Expensive” price groups. Additionally, the “Expensive” group shows a strong presence of high ratings, indicating that consumers are generally satisfied with pricier establishments, likely valuing the overall experience and quality.