Mini2

Zihan Weng

2024-09-27

We first load the data from the RData file.

load("yelp_data_williamsburg.RData")
yelp_data <- yelp_combined

Inspect the data.

head(yelp_data)

## Simple feature collection with 6 features and 19 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -76.72461 ymin: 37.24495 xmax: -76.6585 ymax: 37.2944
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 20
##   id        alias name  image_url is_closed url   review_count categories rating
##   <chr>     <chr> <chr> <chr>     <lgl>     <chr>        <int> <list>      <dbl>
## 1 k_aTyW6m… maur… Maur… https://… FALSE     http…          702 <df>          4.3
## 2 acF2X_gj… food… Food… https://… FALSE     http…         3331 <df>          4.5
## 3 kgCthurC… wayp… Wayp… https://… FALSE     http…          247 <df>          4.3
## 4 zWWMR1xI… old-… Old … https://… FALSE     http…          521 <df>          4.3
## 5 bBVhv4O3… casa… Casa… https://… FALSE     http…          258 <df>          4.6
## 6 Zy848sW5… seco… Seco… https://… FALSE     http…         1599 <df>          4.4
## # ℹ 11 more variables: coordinates <df[,2]>, transactions <list>, price <chr>,
## #   location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## #   business_hours <list>, attributes <df[,4]>, category <chr>,
## #   geometry <POINT [°]>

num_rows <- nrow(yelp_data)
print(num_rows)

## [1] 100

str(yelp_data$coordinates[[1]])

##  num [1:100] 37.2 37.3 37.2 37.3 37.3 ...

str(yelp_data$coordinates[[2]])  # Check the structure of the first element

##  num [1:100] -76.7 -76.7 -76.7 -76.7 -76.7 ...

yelp_data <- yelp_data %>%
  mutate(
    latitude = coordinates[[1]],   # Extract the first vector (latitude)
    longitude = coordinates[[2]]   # Extract the second vector (longitude)
  )

head(yelp_data)

## Simple feature collection with 6 features and 21 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -76.72461 ymin: 37.24495 xmax: -76.6585 ymax: 37.2944
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 22
##   id        alias name  image_url is_closed url   review_count categories rating
##   <chr>     <chr> <chr> <chr>     <lgl>     <chr>        <int> <list>      <dbl>
## 1 k_aTyW6m… maur… Maur… https://… FALSE     http…          702 <df>          4.3
## 2 acF2X_gj… food… Food… https://… FALSE     http…         3331 <df>          4.5
## 3 kgCthurC… wayp… Wayp… https://… FALSE     http…          247 <df>          4.3
## 4 zWWMR1xI… old-… Old … https://… FALSE     http…          521 <df>          4.3
## 5 bBVhv4O3… casa… Casa… https://… FALSE     http…          258 <df>          4.6
## 6 Zy848sW5… seco… Seco… https://… FALSE     http…         1599 <df>          4.4
## # ℹ 13 more variables: coordinates <df[,2]>, transactions <list>, price <chr>,
## #   location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## #   business_hours <list>, attributes <df[,4]>, category <chr>,
## #   geometry <POINT [°]>, latitude <dbl>, longitude <dbl>

Now, let’s tidy the data:

Delete duplicated rows.

yelp_data <- yelp_data %>%
  distinct()

Flatten nested columns.

if ("category" %in% names(yelp_data)) {
  yelp_data <- yelp_data %>%
    unnest(cols = category)
}

Remove rows with missing coordinates.

# Remove rows where either latitude or longitude is missing (i.e., NA values)
yelp_data <- yelp_data%>%
  filter(!is.na(latitude) & !is.na(longitude))

Filter businesses within a specific city boundary of Williamsburg.

# Filter rows based on longitude and latitude
yelp_data <- yelp_data %>%
  filter(longitude > -76.75 & longitude < -76.3, 
         latitude > 37.2 & latitude < 37.4)

# View the filtered data
head(yelp_data)

## Simple feature collection with 6 features and 21 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -76.72461 ymin: 37.24495 xmax: -76.6585 ymax: 37.2944
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 22
##   id        alias name  image_url is_closed url   review_count categories rating
##   <chr>     <chr> <chr> <chr>     <lgl>     <chr>        <int> <list>      <dbl>
## 1 k_aTyW6m… maur… Maur… https://… FALSE     http…          702 <df>          4.3
## 2 acF2X_gj… food… Food… https://… FALSE     http…         3331 <df>          4.5
## 3 kgCthurC… wayp… Wayp… https://… FALSE     http…          247 <df>          4.3
## 4 zWWMR1xI… old-… Old … https://… FALSE     http…          521 <df>          4.3
## 5 bBVhv4O3… casa… Casa… https://… FALSE     http…          258 <df>          4.6
## 6 Zy848sW5… seco… Seco… https://… FALSE     http…         1599 <df>          4.4
## # ℹ 13 more variables: coordinates <df[,2]>, transactions <list>, price <chr>,
## #   location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## #   business_hours <list>, attributes <df[,4]>, category <chr>,
## #   geometry <POINT [°]>, latitude <dbl>, longitude <dbl>

Count the rows that the dataset have now.

num_rows_cleaned <- nrow(yelp_data)
print(num_rows_cleaned)

## [1] 98

I am going to perform an additional wrangling to help with the storytelling related to price.

# Group by price if the column exists and categorize into groups
if ("price" %in% names(yelp_data)) {
  yelp_data <- yelp_data %>%
    mutate(price_group = case_when(
      price == "$" ~ "Cheap",
      price == "$$" ~ "Moderate",
      price == "$$$" ~ "Expensive",
      price == "$$$$" ~ "Very Expensive",
      TRUE ~ "Unknown"
    ))
}

Now, summarize ratings and their relation to review counts.

if ("rating" %in% names(yelp_data) && "review_count" %in% names(yelp_data)) {
  rating_summary <- yelp_data %>%
    group_by(rating) %>%
    summarise(
      avg_reviews = mean(review_count, na.rm = TRUE),
      total_reviews = sum(review_count, na.rm = TRUE),
      business_count = n()
    ) %>%
    arrange(desc(business_count))
  
  print(rating_summary)
}

## Simple feature collection with 22 features and 4 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -76.74666 ymin: 37.23163 xmax: -76.64509 ymax: 37.3281
## Geodetic CRS:  WGS 84
## # A tibble: 22 × 5
##    rating avg_reviews total_reviews business_count                      geometry
##     <dbl>       <dbl>         <int>          <int>              <MULTIPOINT [°]>
##  1    4.4        615.          7998             13 ((-76.67594 37.26842), (-76.…
##  2    4.1        473.          4734             10 ((-76.68775 37.27389), (-76.…
##  3    4.2        235.          2348             10 ((-76.69757 37.26888), (-76.…
##  4    4.5        583.          5251              9 ((-76.65928 37.24632), (-76.…
##  5    4          389           3112              8 ((-76.69136 37.27167), (-76.…
##  6    3.8        594.          4156              7 ((-76.69542 37.27128), (-76.…
##  7    4.3        296           1776              6 ((-76.68554 37.24923), (-76.…
##  8    3.4        308.          1538              5 ((-76.70069 37.26825), (-76.…
##  9    3.9        678           3390              5 ((-76.70695 37.26983), (-76.…
## 10    4.6        138            690              5 ((-76.67712 37.26805), (-76.…
## # ℹ 12 more rows

I am going to perform the analysis on price and rating distribution

# Price and rating distribution analysis
if ("price_group" %in% names(yelp_data) && "rating" %in% names(yelp_data)) {
  price_rating_summary <- yelp_data %>%
    group_by(price_group, rating) %>%
    summarise(
      avg_reviews = mean(review_count, na.rm = TRUE),
      total_reviews = sum(review_count, na.rm = TRUE),
      business_count = n()
    ) %>%
    arrange(price_group, desc(rating))
  
  print(price_rating_summary)
}

## `summarise()` has grouped output by 'price_group'. You can override using the
## `.groups` argument.

## Simple feature collection with 36 features and 5 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -76.74666 ymin: 37.23163 xmax: -76.64509 ymax: 37.3281
## Geodetic CRS:  WGS 84
## # A tibble: 36 × 6
## # Groups:   price_group [4]
##    price_group rating avg_reviews total_reviews business_count
##    <chr>        <dbl>       <dbl>         <int>          <int>
##  1 Cheap          4          623            623              1
##  2 Expensive      4.7         29             29              1
##  3 Expensive      4.5        448.          1790              4
##  4 Expensive      4.4        206            412              2
##  5 Expensive      4.3        247            247              1
##  6 Expensive      4.2         31             31              1
##  7 Expensive      4.1        606           1212              2
##  8 Expensive      4          328.           655              2
##  9 Expensive      3.8        394            788              2
## 10 Expensive      3.7         13             13              1
## # ℹ 26 more rows
## # ℹ 1 more variable: geometry <GEOMETRY [°]>

# Plot distribution of ratings by price group
if ("rating" %in% names(yelp_data) && "price_group" %in% names(yelp_data)) {
  ggplot(yelp_data, aes(x = rating, fill = price_group)) +
    geom_bar() +
    labs(title = "Distribution of Ratings by Price Group",
         x = "Rating",
         y = "Count of Businesses",
         fill = "Price Group") +
    theme_minimal()
}

I saw that the business categorized as ‘cheap’ only takes a very small portion, so I decided to check on the number of it.

# Count the number of "Cheap" businesses
cheap_count <- yelp_data %>%
  filter(price_group == "Cheap") %>%
  nrow()

# Display the count
print(cheap_count)

## [1] 1

Save the cleaned data just in case we need it for future analysis.

save(yelp_data, file = "tidy_yelp_data_williamsburg.RData")

With these analysis, I would be able to tell a short story about the findings.

cheap_count_untidy <- yelp_data %>%
  filter(price_group == "Cheap") %>%
  nrow()

# Print the count of "Cheap" businesses before tidying
print(paste("Cheap businesses in untidy data:", cheap_count_untidy))

## [1] "Cheap businesses in untidy data: 1"

Here’s the short story:

After tidying the Yelp dataset, I noticed that the number of records decreased from 100 to 98. This reduction can likely be attributed to the removal of incomplete or duplicate entries during the data cleaning process. Specifically, rows with missing latitude or longitude coordinates were filtered out, as well as any duplicates that may have existed in the original dataset. This slight decrease of 2 rows suggests that only a small portion of the data was affected by tidying, and the majority of the information remained intact and usable for analysis.

The analysis revealed that most businesses fall within the “Moderate” and “Expensive” price groups. A distribution plot shows that businesses, particularly in these two price categories, cluster around a 4-star rating. This suggests that businesses in Williamsburg with moderate or higher prices generally maintain strong ratings, with a large concentration of businesses rated between 3.5 and 4.5 stars. Notably, even though price groups like “Cheap” and “Unknown” are underrepresented, the businesses within these groups still receive ratings comparable to higher-priced businesses. This reinforces the idea that price does not directly determine quality or customer satisfaction, as businesses across all price groups can achieve similar ratings.

Another key finding is the relatively low presence of “Cheap” businesses in the dataset, with only one such business identified, both before and after tidying the data. This suggests that Williamsburg may cater more to middle- and higher-income consumers, as the majority of businesses fall into the “Moderate” and “Expensive” price groups. Additionally, the “Expensive” group shows a strong presence of high ratings, indicating that consumers are generally satisfied with pricier establishments, likely valuing the overall experience and quality.