Introduction

California’s restaurant inspection program provides comprehensive health and safety data for food establishments across the state. By analyzing inspection data from California’s health departments and linking it to public sentiment expressed in Yelp reviews, this project explores how regulatory compliance and customer experience intersect in the state’s diverse dining landscape.

The goal is to uncover patterns between inspection outcomes and online reviews—highlighting which types of violations may correlate with lower ratings or negative feedback. This analysis combines structured government data with unstructured consumer sentiment to offer a richer understanding of restaurant performance, public perception, and the broader implications for public health policy and hospitality management.

Setup Global Options

Load Required Libraries

req_packages <- c("DBI","RMySQL","dplyr","dbplyr","knitr","tidyr", "readr", "stringr","tibble", "rmarkdown", "purrr", "lubridate", "here", "httr2", "httr",  "RCurl","rvest","xml2","jsonlite","kableExtra", "tidytext", "geniusr","sentimentr","syuzhet","ggplot2", "tidyverse","DT","fuzzyjoin")
for (pkg in req_packages) {
  if (!require(pkg, character.only = TRUE)) {
    message(paste("Installing package:", pkg))
    install.packages(pkg, dependencies = TRUE)
  } else {
    message(paste(pkg, " already installed."))
  }
  library(pkg, character.only = TRUE)
}

## Loading required package: DBI

## DBI  already installed.

## Loading required package: RMySQL

## RMySQL  already installed.

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## dplyr  already installed.

## Loading required package: dbplyr

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

## dbplyr  already installed.

## Loading required package: knitr

## knitr  already installed.

## Loading required package: tidyr

## tidyr  already installed.

## Loading required package: readr

## readr  already installed.

## Loading required package: stringr

## stringr  already installed.

## Loading required package: tibble

## tibble  already installed.

## Loading required package: rmarkdown

## rmarkdown  already installed.

## Loading required package: purrr

## purrr  already installed.

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## lubridate  already installed.

## Loading required package: here

## here() starts at /Users/paulabrown/Documents/CUNY SPS- Data 607/Final Project

## here  already installed.

## Loading required package: httr2

## httr2  already installed.

## Loading required package: httr

## httr  already installed.

## Loading required package: RCurl

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

## RCurl  already installed.

## Loading required package: rvest

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

## rvest  already installed.

## Loading required package: xml2

## 
## Attaching package: 'xml2'

## The following object is masked from 'package:httr2':
## 
##     url_parse

## xml2  already installed.

## Loading required package: jsonlite

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

## jsonlite  already installed.

## Loading required package: kableExtra

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

## kableExtra  already installed.

## Loading required package: tidytext

## tidytext  already installed.

## Loading required package: geniusr

## geniusr  already installed.

## Loading required package: sentimentr

## sentimentr  already installed.

## Loading required package: syuzhet

## 
## Attaching package: 'syuzhet'

## The following object is masked from 'package:sentimentr':
## 
##     get_sentences

## syuzhet  already installed.

## Loading required package: ggplot2

## ggplot2  already installed.

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ RCurl::complete()        masks tidyr::complete()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ jsonlite::flatten()      masks purrr::flatten()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ rvest::guess_encoding()  masks readr::guess_encoding()
## ✖ dbplyr::ident()          masks dplyr::ident()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ dbplyr::sql()            masks dplyr::sql()
## ✖ xml2::url_parse()        masks httr2::url_parse()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## tidyverse  already installed.
## 
## Loading required package: DT
## 
## DT  already installed.
## 
## Loading required package: fuzzyjoin
## 
## fuzzyjoin  already installed.

Data Loading

CA Restaurant Data

Retrieve the CA restaurant inspection data from GitHub and load into dataframes.

# Base URL for raw GitHub files
base_url <- "https://raw.githubusercontent.com/PaulaB989/NYC_Open_Restaurants_and_Yelp_Reviews/main/"

# Read/push CSV data into dataframe
# ---- CA Restaurant CSV ----
CA_Inspections_csv <- paste0(base_url, "CA_inspections.csv")
CA_Inspections <- read_csv(CA_Inspections_csv)

# ---- CA Violations ----
CA_Violations_csv <- paste0(base_url, "CA_violations.csv")
CA_Violations <- read_csv(CA_Violations_csv)

# Display structure
cat("CA Restaurant Inspections data loaded:\n")

## CA Restaurant Inspections data loaded:

cat("Rows:", nrow(CA_Inspections), "\n")

## Rows: 191371

cat("Columns:", ncol(CA_Inspections), "\n")

## Columns: 20

cat("Column names:", paste(names(CA_Inspections), collapse = ", "), "\n\n")

## Column names: activity_date, employee_id, facility_address, facility_city, facility_id, facility_name, facility_state, facility_zip, grade, owner_id, owner_name, pe_description, program_element_pe, program_name, program_status, record_id, score, serial_number, service_code, service_description

cat("CA Restaurants Violations loaded:\n")

## CA Restaurants Violations loaded:

cat("Rows:", nrow(CA_Violations), "\n")

## Rows: 906014

cat("Columns:", ncol(CA_Violations), "\n")

## Columns: 5

cat("Column names:", paste(names(CA_Violations), collapse = ", "), "\n\n")

## Column names: points, serial_number, violation_code, violation_description, violation_status

#Glimpse the data
glimpse(CA_Inspections)

## Rows: 191,371
## Columns: 20
## $ activity_date       <date> 2017-05-09, 2017-04-10, 2017-04-04, 2017-08-15, 2…
## $ employee_id         <chr> "EE0000593", "EE0000126", "EE0000593", "EE0000971"…
## $ facility_address    <chr> "17660 CHATSWORTH ST", "3615 PACIFIC COAST HWY", "…
## $ facility_city       <chr> "GRANADA HILLS", "TORRANCE", "GRANADA HILLS", "LAN…
## $ facility_id         <chr> "FA0175397", "FA0242138", "FA0007801", "FA0013858"…
## $ facility_name       <chr> "HOVIK'S FAMOUS MEAT & DELI", "SHAKEY'S PIZZA", "B…
## $ facility_state      <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "C…
## $ facility_zip        <chr> "91344", "90505", "91344", "93536", "90701", "9000…
## $ grade               <chr> "A", "A", "A", "A", "A", "A", "B", "A", "A", "A", …
## $ owner_id            <chr> "OW0181955", "OW0237843", "OW0031150", "OW0012108"…
## $ owner_name          <chr> "JOHN'S FAMOUS MEAT & DELI INC.", "SCO, LLC", "SAB…
## $ pe_description      <chr> "FOOD MKT RETAIL (25-1,999 SF) HIGH RISK", "RESTAU…
## $ program_element_pe  <dbl> 1612, 1638, 1612, 1632, 1638, 1632, 1641, 1635, 16…
## $ program_name        <chr> "HOVIK'S FAMOUS MEAT & DELI", "SHAKEY'S PIZZA", "B…
## $ program_status      <chr> "ACTIVE", "ACTIVE", "INACTIVE", "ACTIVE", "ACTIVE"…
## $ record_id           <chr> "PR0168541", "PR0190290", "PR0036723", "PR0039905"…
## $ score               <dbl> 98, 94, 95, 98, 96, 96, 87, 96, 96, 94, 86, 94, 90…
## $ serial_number       <chr> "DAHDRUQZO", "DAL3SBUE0", "DAL2PIKJU", "DA0ZMAJXZ"…
## $ service_code        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ service_description <chr> "ROUTINE INSPECTION", "ROUTINE INSPECTION", "ROUTI…

glimpse(CA_Violations)

## Rows: 906,014
## Columns: 5
## $ points                <dbl> 1, 4, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, …
## $ serial_number         <chr> "DAJ5UNMSF", "DAT2HKIRE", "DAT2HKIRE", "DAT2HKIR…
## $ violation_code        <chr> "F044", "F007", "F033", "F035", "F033", "F007", …
## $ violation_description <chr> "# 44. Floors, walls and ceilings: properly buil…
## $ violation_status      <chr> "OUT OF COMPLIANCE", "OUT OF COMPLIANCE", "OUT O…

Prepare pulling split business and review files from GitHub

Since the files were too large to upload into Github via web, the files were downloaded from the Yelp Open Dataset JSON file at https://business.yelp.com/data/resources/open-dataset/, then split into uploadable sizes and uploaded into GitHub using mac Terminal. Now that we have split files we need to append them for our respective data frames.

repo_api <- "https://api.github.com/repos/PaulaB989/NYC_Open_Restaurants_and_Yelp_Reviews/contents"
files_info <- content(GET(repo_api))

# Extract file names
all_files <- sapply(files_info, function(x) x$name)
cat("Total files found:", length(all_files), "\n")

## Total files found: 59

cat("All files:\n")

## All files:

print(all_files)

##  [1] "CA_inspections.csv"                         
##  [2] "CA_violations.csv"                          
##  [3] "Open_Restaurants_Inspections_20251210.csv"  
##  [4] "yelp_academic_dataset_business_part_aa.json"
##  [5] "yelp_academic_dataset_business_part_ab.json"
##  [6] "yelp_academic_dataset_review_part_aa.json"  
##  [7] "yelp_academic_dataset_review_part_ab.json"  
##  [8] "yelp_academic_dataset_review_part_ac.json"  
##  [9] "yelp_academic_dataset_review_part_ad.json"  
## [10] "yelp_academic_dataset_review_part_ae.json"  
## [11] "yelp_academic_dataset_review_part_af.json"  
## [12] "yelp_academic_dataset_review_part_ag.json"  
## [13] "yelp_academic_dataset_review_part_ah.json"  
## [14] "yelp_academic_dataset_review_part_ai.json"  
## [15] "yelp_academic_dataset_review_part_aj.json"  
## [16] "yelp_academic_dataset_review_part_ak.json"  
## [17] "yelp_academic_dataset_review_part_al.json"  
## [18] "yelp_academic_dataset_review_part_am.json"  
## [19] "yelp_academic_dataset_review_part_an.json"  
## [20] "yelp_academic_dataset_review_part_ao.json"  
## [21] "yelp_academic_dataset_review_part_ap.json"  
## [22] "yelp_academic_dataset_review_part_aq.json"  
## [23] "yelp_academic_dataset_review_part_ar.json"  
## [24] "yelp_academic_dataset_review_part_as.json"  
## [25] "yelp_academic_dataset_review_part_at.json"  
## [26] "yelp_academic_dataset_review_part_au.json"  
## [27] "yelp_academic_dataset_review_part_av.json"  
## [28] "yelp_academic_dataset_review_part_aw.json"  
## [29] "yelp_academic_dataset_review_part_ax.json"  
## [30] "yelp_academic_dataset_review_part_ay.json"  
## [31] "yelp_academic_dataset_review_part_az.json"  
## [32] "yelp_academic_dataset_review_part_ba.json"  
## [33] "yelp_academic_dataset_review_part_bb.json"  
## [34] "yelp_academic_dataset_review_part_bc.json"  
## [35] "yelp_academic_dataset_review_part_bd.json"  
## [36] "yelp_academic_dataset_review_part_be.json"  
## [37] "yelp_academic_dataset_review_part_bf.json"  
## [38] "yelp_academic_dataset_review_part_bg.json"  
## [39] "yelp_academic_dataset_review_part_bh.json"  
## [40] "yelp_academic_dataset_review_part_bi.json"  
## [41] "yelp_academic_dataset_review_part_bj.json"  
## [42] "yelp_academic_dataset_review_part_bk.json"  
## [43] "yelp_academic_dataset_review_part_bl.json"  
## [44] "yelp_academic_dataset_review_part_bm.json"  
## [45] "yelp_academic_dataset_review_part_bn.json"  
## [46] "yelp_academic_dataset_review_part_bo.json"  
## [47] "yelp_academic_dataset_review_part_bp.json"  
## [48] "yelp_academic_dataset_review_part_bq.json"  
## [49] "yelp_academic_dataset_review_part_br.json"  
## [50] "yelp_academic_dataset_review_part_bs.json"  
## [51] "yelp_academic_dataset_review_part_bt.json"  
## [52] "yelp_academic_dataset_review_part_bu.json"  
## [53] "yelp_academic_dataset_review_part_bv.json"  
## [54] "yelp_academic_dataset_review_part_bw.json"  
## [55] "yelp_academic_dataset_review_part_bx.json"  
## [56] "yelp_academic_dataset_review_part_by.json"  
## [57] "yelp_academic_dataset_review_part_bz.json"  
## [58] "yelp_academic_dataset_review_part_ca.json"  
## [59] "yelp_academic_dataset_review_part_cb.json"

Yelp Business Data

All files are in one folder, so here we use a pattern and pull them into a frame.

business_files <- grep("business.*\\.json", all_files, value = TRUE)
cat("Business files found:", length(business_files), "\n")

## Business files found: 2

print(business_files)

## [1] "yelp_academic_dataset_business_part_aa.json"
## [2] "yelp_academic_dataset_business_part_ab.json"

Yelp Reviews Data

All files are in one folder, so here we use a pattern and pull them into a frame.

review_files <- grep("review.*\\.json", all_files, value = TRUE)
cat("Review files found:", length(review_files), "\n")

## Review files found: 54

print(review_files)

##  [1] "yelp_academic_dataset_review_part_aa.json"
##  [2] "yelp_academic_dataset_review_part_ab.json"
##  [3] "yelp_academic_dataset_review_part_ac.json"
##  [4] "yelp_academic_dataset_review_part_ad.json"
##  [5] "yelp_academic_dataset_review_part_ae.json"
##  [6] "yelp_academic_dataset_review_part_af.json"
##  [7] "yelp_academic_dataset_review_part_ag.json"
##  [8] "yelp_academic_dataset_review_part_ah.json"
##  [9] "yelp_academic_dataset_review_part_ai.json"
## [10] "yelp_academic_dataset_review_part_aj.json"
## [11] "yelp_academic_dataset_review_part_ak.json"
## [12] "yelp_academic_dataset_review_part_al.json"
## [13] "yelp_academic_dataset_review_part_am.json"
## [14] "yelp_academic_dataset_review_part_an.json"
## [15] "yelp_academic_dataset_review_part_ao.json"
## [16] "yelp_academic_dataset_review_part_ap.json"
## [17] "yelp_academic_dataset_review_part_aq.json"
## [18] "yelp_academic_dataset_review_part_ar.json"
## [19] "yelp_academic_dataset_review_part_as.json"
## [20] "yelp_academic_dataset_review_part_at.json"
## [21] "yelp_academic_dataset_review_part_au.json"
## [22] "yelp_academic_dataset_review_part_av.json"
## [23] "yelp_academic_dataset_review_part_aw.json"
## [24] "yelp_academic_dataset_review_part_ax.json"
## [25] "yelp_academic_dataset_review_part_ay.json"
## [26] "yelp_academic_dataset_review_part_az.json"
## [27] "yelp_academic_dataset_review_part_ba.json"
## [28] "yelp_academic_dataset_review_part_bb.json"
## [29] "yelp_academic_dataset_review_part_bc.json"
## [30] "yelp_academic_dataset_review_part_bd.json"
## [31] "yelp_academic_dataset_review_part_be.json"
## [32] "yelp_academic_dataset_review_part_bf.json"
## [33] "yelp_academic_dataset_review_part_bg.json"
## [34] "yelp_academic_dataset_review_part_bh.json"
## [35] "yelp_academic_dataset_review_part_bi.json"
## [36] "yelp_academic_dataset_review_part_bj.json"
## [37] "yelp_academic_dataset_review_part_bk.json"
## [38] "yelp_academic_dataset_review_part_bl.json"
## [39] "yelp_academic_dataset_review_part_bm.json"
## [40] "yelp_academic_dataset_review_part_bn.json"
## [41] "yelp_academic_dataset_review_part_bo.json"
## [42] "yelp_academic_dataset_review_part_bp.json"
## [43] "yelp_academic_dataset_review_part_bq.json"
## [44] "yelp_academic_dataset_review_part_br.json"
## [45] "yelp_academic_dataset_review_part_bs.json"
## [46] "yelp_academic_dataset_review_part_bt.json"
## [47] "yelp_academic_dataset_review_part_bu.json"
## [48] "yelp_academic_dataset_review_part_bv.json"
## [49] "yelp_academic_dataset_review_part_bw.json"
## [50] "yelp_academic_dataset_review_part_bx.json"
## [51] "yelp_academic_dataset_review_part_by.json"
## [52] "yelp_academic_dataset_review_part_bz.json"
## [53] "yelp_academic_dataset_review_part_ca.json"
## [54] "yelp_academic_dataset_review_part_cb.json"

Safe Load Function

Try and Catch errors and skip files with errors

# Function specifically designed for JSONL format (one JSON per line)
safe_load <- function(file_name) {
  file_url <- paste0(base_url, file_name)
  
  # Method 1: stream_in (best for JSONL/ndjson format like Yelp data)
  result <- tryCatch({
    cat("  Loading (stream_in):", file_name, "\n")
    con <- url(file_url)
    data <- stream_in(con, verbose = FALSE)
    close(con)
    
    if(is.data.frame(data) && nrow(data) > 0) {
      cat("  SUCCESS -", nrow(data), "rows loaded\n")
      return(data)
    }
    NULL
  }, error = function(e) {
    cat("   stream_in failed:", e$message, "\n")
    NULL
  })
  
  if(!is.null(result)) return(result)
  
  # Method 2: Read line by line and parse each JSON object
  result <- tryCatch({
    cat("  Trying line-by-line parsing...\n")
    lines <- readLines(file_url, warn = FALSE)
    
    # Parse each non-empty line as a separate JSON object
    data_list <- lapply(seq_along(lines), function(i) {
      if(lines[i] != "" && nchar(lines[i]) > 2) {
        tryCatch({
          fromJSON(lines[i], flatten = TRUE)
        }, error = function(e) {
          if(i <= 5) cat("    Error on line", i, ":", e$message, "\n")
          NULL
        })
      } else {
        NULL
      }
    })
    
    # Remove NULL entries and combine
    data_list <- Filter(Negate(is.null), data_list)
    
    if(length(data_list) > 0) {
      data <- bind_rows(data_list)
      if(nrow(data) > 0) {
        cat("  SUCCESS - line-by-line loaded", nrow(data), "rows\n")
        return(data)
      }
    }
    NULL
  }, error = function(e) {
    cat("   line-by-line failed:", e$message, "\n")
    NULL
  })
  
  if(!is.null(result)) return(result)
  
  cat("   FAILED - Could not load", file_name, "\n")
  return(NULL)
}

Safe Load Business Function

Special function to handle business data with nested columns

safe_load_business <- function(file_name) {
  file_url <- paste0(base_url, file_name)
  
  cat("  Loading:", file_name, "\n")
  
  tryCatch({
    # Download file
    lines <- readLines(file_url, warn = FALSE)
    lines <- lines[nchar(lines) > 0]
    cat("    Downloaded", length(lines), "lines\n")
    
    if(length(lines) == 0) {
      cat("   No lines to parse\n")
      return(NULL)
    }
    
    # Parse line by line
    cat("    Parsing JSON...\n")
    data_list <- list()
    error_count <- 0
    
    for(i in seq_along(lines)) {
      obj <- tryCatch({
        # Parse without flatten
        parsed <- fromJSON(lines[i], flatten = FALSE)
        
        # Convert ALL nested/list columns to character strings
        for(col_name in names(parsed)) {
          if(is.list(parsed[[col_name]]) && !is.data.frame(parsed[[col_name]])) {
            # Convert to JSON string if it's a list
parsed[[col_name]] <- as.character(toJSON(parsed[[col_name]], auto_unbox = TRUE))
} else if(is.null(parsed[[col_name]])) {
  # Convert NULL to NA character
  parsed[[col_name]] <- NA_character_
}
}

# Convert to single-row data frame to ensure consistent structure
as.data.frame(parsed, stringsAsFactors = FALSE)
}, error = function(e) {
  error_count <<- error_count + 1
  if(error_count <= 5) {
    cat("      Line", i, "error:", substr(e$message, 1, 80), "\n")
  }
  NULL
})

if(!is.null(obj)) {
  data_list[[length(data_list) + 1]] <- obj
}

# Progress indicator
if(i %% 25000 == 0) {
  cat("      Parsed", i, "/", length(lines), "-", length(data_list), "successful\n")
}
}

cat("    Parsing complete:", length(data_list), "successful,", error_count, "errors\n")

# Combine with bind_rows
if(length(data_list) > 0) {
  cat("    Combining rows...\n")
  data <- bind_rows(data_list)
  cat("  SUCCESS -", nrow(data), "rows,", ncol(data), "columns\n")
  return(data)
} else {
  cat("   No valid data parsed\n")
  return(NULL)
}

}, error = function(e) {
  cat("   FAILED:", e$message, "\n")
  return(NULL)
})
}

Test Load Single File

# Quick test of the first business file to verify loading works
if(length(business_files) > 0) {
  cat("\n=== TESTING FIRST BUSINESS FILE ===\n")
  test_file <- business_files[1]
  cat("Testing file:", test_file, "\n\n")
  
  test_result <- safe_load_business(test_file)
  
  if(!is.null(test_result) && nrow(test_result) > 0) {
    cat("\nTest successful! File format is readable.\n")
    cat("Sample data from first file:\n")
    cat("- Rows:", nrow(test_result), "\n")
    cat("- Columns:", ncol(test_result), "\n")
    cat("- Column names:", paste(head(names(test_result), 10), collapse = ", "), "\n")
    
    # Show first few CA businesses if any
    if("state" %in% names(test_result)) {
      ca_count <- sum(test_result$state == "CA", na.rm = TRUE)
      cat("- CA businesses in this file:", ca_count, "\n")
    }
  } else {
    cat("\n Test failed - check the error messages above\n")
  }
}

## 
## === TESTING FIRST BUSINESS FILE ===
## Testing file: yelp_academic_dataset_business_part_aa.json 
## 
##   Loading: yelp_academic_dataset_business_part_aa.json 
##     Downloaded 126031 lines
##     Parsing JSON...
##       Parsed 25000 / 126031 - 25000 successful
##       Parsed 50000 / 126031 - 50000 successful
##       Parsed 75000 / 126031 - 75000 successful
##       Parsed 100000 / 126031 - 100000 successful
##       Parsed 125000 / 126031 - 125000 successful
##       Line 126031 error: parse error: premature EOF
##                                        {"business_id" 
##     Parsing complete: 126030 successful, 1 errors
##     Combining rows...
##   SUCCESS - 126030 rows, 14 columns
## 
## Test successful! File format is readable.
## Sample data from first file:
## - Rows: 126030 
## - Columns: 14 
## - Column names: business_id, name, address, city, state, postal_code, latitude, longitude, stars, review_count 
## - CA businesses in this file: 4343

Load Business Data

cat("\n=== LOADING BUSINESS FILES ===\n")

## 
## === LOADING BUSINESS FILES ===

cat("Total files to load:", length(business_files), "\n\n")

## Total files to load: 2

# Load business files using the simpler function
business_list <- list()
for(i in seq_along(business_files)) {
  cat("Loading business file", i, "of", length(business_files), "\n")
  business_list[[i]] <- safe_load_business(business_files[i])
}

## Loading business file 1 of 2 
##   Loading: yelp_academic_dataset_business_part_aa.json 
##     Downloaded 126031 lines
##     Parsing JSON...
##       Parsed 25000 / 126031 - 25000 successful
##       Parsed 50000 / 126031 - 50000 successful
##       Parsed 75000 / 126031 - 75000 successful
##       Parsed 100000 / 126031 - 100000 successful
##       Parsed 125000 / 126031 - 125000 successful
##       Line 126031 error: parse error: premature EOF
##                                        {"business_id" 
##     Parsing complete: 126030 successful, 1 errors
##     Combining rows...
##   SUCCESS - 126030 rows, 14 columns
## Loading business file 2 of 2 
##   Loading: yelp_academic_dataset_business_part_ab.json 
##     Downloaded 24316 lines
##     Parsing JSON...
##       Line 1 error: parse error: trailing garbage
##                             "Wednesday":"10:0-18:0 
##     Parsing complete: 24315 successful, 1 errors
##     Combining rows...
##   SUCCESS - 24315 rows, 14 columns

# Remove NULL entries
business_list <- Filter(Negate(is.null), business_list)
cat("\nSuccessfully loaded", length(business_list), "out of", length(business_files), "business files\n")

## 
## Successfully loaded 2 out of 2 business files

if(length(business_list) > 0) {
  cat("\nCombining business data...\n")
  business <- bind_rows(business_list)
  cat("\nCombined Business data:\n")
  cat("Rows:", nrow(business), "\n")
  cat("Columns:", ncol(business), "\n")
  if(ncol(business) > 0) {
    cat("Column names:", paste(head(names(business), 15), collapse = ", "), 
        if(ncol(business) > 15) "..." else "", "\n")
  }
  glimpse(business)
} else {
  cat("\n WARNING: No business data loaded!\n")
  business <- data.frame()
}

## 
## Combining business data...
## 
## Combined Business data:
## Rows: 150345 
## Columns: 14 
## Column names: business_id, name, address, city, state, postal_code, latitude, longitude, stars, review_count, is_open, attributes, categories, hours  
## Rows: 150,345
## Columns: 14
## $ business_id  <chr> "Pns2l4eNsfO8kk83dixA6A", "mpf3x-BjTdTEA3yCZrAYPw", "tUFr…
## $ name         <chr> "Abby Rappoport, LAC, CMQ", "The UPS Store", "Target", "S…
## $ address      <chr> "1616 Chapala St, Ste 2", "87 Grasso Plaza Shopping Cente…
## $ city         <chr> "Santa Barbara", "Affton", "Tucson", "Philadelphia", "Gre…
## $ state        <chr> "CA", "MO", "AZ", "PA", "PA", "TN", "MO", "FL", "MO", "TN…
## $ postal_code  <chr> "93101", "63123", "85711", "19107", "18054", "37015", "63…
## $ latitude     <dbl> 34.42668, 38.55113, 32.22324, 39.95551, 40.33818, 36.2695…
## $ longitude    <dbl> -119.71120, -90.33570, -110.88045, -75.15556, -75.47166, …
## $ stars        <dbl> 5.0, 3.0, 3.5, 4.0, 4.5, 2.0, 2.5, 3.5, 3.0, 1.5, 3.5, 4.…
## $ review_count <int> 7, 15, 22, 80, 13, 6, 13, 5, 19, 10, 6, 10, 28, 10, 100, …
## $ is_open      <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
## $ attributes   <chr> "{\"ByAppointmentOnly\":\"True\"}", "{\"BusinessAcceptsCr…
## $ categories   <chr> "Doctors, Traditional Chinese Medicine, Naturopathic/Holi…
## $ hours        <chr> NA, "{\"Monday\":\"0:0-0:0\",\"Tuesday\":\"8:0-18:30\",\"…

Load Reviews

Load review JSON files into the reviews data frame

# Load Review JSONs
cat("\n=== LOADING REVIEW FILES ===\n")

## 
## === LOADING REVIEW FILES ===

cat("Total files to load:", length(review_files), "\n")

## Total files to load: 54

cat("  This may take several minutes with multiple files...\n\n")

##   This may take several minutes with multiple files...

# Load reviews in batches with progress
review_list <- list()
for(i in seq_along(review_files)) {
  if(i %% 10 == 0) cat("Progress:", i, "/", length(review_files), "files\n")
  review_list[[i]] <- safe_load(review_files[i])
}

##   Loading (stream_in): yelp_academic_dataset_review_part_aa.json

##    stream_in failed: parse error: premature EOF
##                                        {"review_id":"9puAqBsJ7Z2xe45ZQ
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##   SUCCESS - line-by-line loaded 133521 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ab.json 
##    stream_in failed: lexical error: invalid string in json text.
##                                        tars":5.0,"useful":4,"funny":2,
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid string in json text.
##                                        tars":5.0,"useful":4,"funny":2,
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133189 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ac.json 
##    stream_in failed: lexical error: malformed number, a digit is required after the minus sign.
##                                        -ptC259Kpu8lIWxTw","stars":5.0,
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: malformed number, a digit is required after the minus sign.
##                                        -ptC259Kpu8lIWxTw","stars":5.0,
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131192 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ad.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ch fun. Our wonderful bartender
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ch fun. Our wonderful bartender
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 129949 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ae.json 
##    stream_in failed: parse error: unallowed token at this point in JSON text
##                                        :2.0,"useful":9,"funny":5,"cool
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: unallowed token at this point in JSON text
##                                        :2.0,"useful":9,"funny":5,"cool
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 125298 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_af.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        htful. Chef brought out our foo
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        htful. Chef brought out our foo
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 129277 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ag.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        id":"Ldch7Nc5gaZrhcIRd7mcjw","s
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        id":"Ldch7Nc5gaZrhcIRd7mcjw","s
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 134143 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ah.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        hem better.  The ribs were OK a
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        hem better.  The ribs were OK a
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 132166 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ai.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        d from this place, except that 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        d from this place, except that 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131474 rows
## Progress: 10 / 54 files
##   Loading (stream_in): yelp_academic_dataset_review_part_aj.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        corporate & was told they were 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        corporate & was told they were 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 130843 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ak.json 
##    stream_in failed: parse error: unallowed token at this point in JSON text
##                                        :10,"funny":0,"cool":0,"text":"
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: unallowed token at this point in JSON text
##                                        :10,"funny":0,"cool":0,"text":"
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 121840 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_al.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        oke and crab dip, tiny portion 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        oke and crab dip, tiny portion 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 134815 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_am.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        l giants.  Great location and t
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        l giants.  Great location and t
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 134080 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_an.json 
##    stream_in failed: lexical error: invalid string in json text.
##                                        name come to him, but then he s
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid string in json text.
##                                        name come to him, but then he s
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131592 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ao.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        en. Sincerely Friendly,attentiv
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        en. Sincerely Friendly,attentiv
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 130863 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ap.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        my daughter welcomed me with a 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        my daughter welcomed me with a 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 127308 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_aq.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        orite as they were all amazing 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        orite as they were all amazing 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 127003 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ar.json 
##    stream_in failed: lexical error: malformed number, a digit is required after the decimal point.
##                                       6.\"  But, I would give the staf
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: malformed number, a digit is required after the decimal point.
##                                       6.\"  But, I would give the staf
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133438 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_as.json 
##    stream_in failed: parse error: trailing garbage
##                                     ":"I would give it a zero but I'll
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: trailing garbage
##                                     ":"I would give it a zero but I'll
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131959 rows
## Progress: 20 / 54 files
##   Loading (stream_in): yelp_academic_dataset_review_part_at.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ato purée, oh my God, she said
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ato purée, oh my God, she said
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 129855 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_au.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        Ribs, Pork and Beef. All the sa
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        Ribs, Pork and Beef. All the sa
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 125754 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_av.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        .  Might be better.\n\nNo views
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        .  Might be better.\n\nNo views
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 127802 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_aw.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        while I found the potatoes to b
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        while I found the potatoes to b
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133097 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ax.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        e burger was nothing to brag ab
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        e burger was nothing to brag ab
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131922 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ay.json 
##    stream_in failed: lexical error: invalid string in json text.
##                                        n1E9-A","user_id":"JS7VSWD7Xc4C
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid string in json text.
##                                        n1E9-A","user_id":"JS7VSWD7Xc4C
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131217 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_az.json 
##    stream_in failed: lexical error: invalid string in json text.
##                                        for brunch on Sunday with our d
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid string in json text.
##                                        for brunch on Sunday with our d
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 127754 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ba.json 
##    stream_in failed: parse error: trailing garbage
##                                 ":5.0,"useful":0,"funny":0,"cool":0,"t
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: trailing garbage
##                                 ":5.0,"useful":0,"funny":0,"cool":0,"t
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 123597 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bb.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        hey gave more slices of the bre
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        hey gave more slices of the bre
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 134853 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bc.json 
##    stream_in failed: parse error: trailing garbage
##                                  "text":"Mixed emotions.\n\n1. No oran
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: trailing garbage
##                                  "text":"Mixed emotions.\n\n1. No oran
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133486 rows
## Progress: 30 / 54 files
##   Loading (stream_in): yelp_academic_dataset_review_part_bd.json 
##    stream_in failed: parse error: trailing garbage
##                                       08-14 20:22:07"}
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: trailing garbage
##                                       08-14 20:22:07"}
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131506 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_be.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        st 20 minutes before she came b
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        st 20 minutes before she came b
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 130778 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bf.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        s marinated in something that m
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        s marinated in something that m
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 122275 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bg.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        gnose and fix the problem with 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        gnose and fix the problem with 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133073 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bh.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        at the base of the garage and h
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        at the base of the garage and h
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133374 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bi.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ude and didn't answer back. Thi
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ude and didn't answer back. Thi
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131022 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bj.json 
##    stream_in failed: parse error: trailing garbage
##                                      9-06-08 21:55:09"}
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: trailing garbage
##                                      9-06-08 21:55:09"}
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 130680 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bk.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        dition is like a sweet clotted 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        dition is like a sweet clotted 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 128060 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bl.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        brought mines out. So now im an
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        brought mines out. So now im an
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 124642 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bm.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        read all the good reviews and I
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        read all the good reviews and I
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 135082 rows
## Progress: 40 / 54 files
##   Loading (stream_in): yelp_academic_dataset_review_part_bn.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ait FOREVER for our food.  The 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ait FOREVER for our food.  The 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133013 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bo.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        i School. If you'd like your ch
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        i School. If you'd like your ch
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133052 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bp.json 
##    stream_in failed: lexical error: invalid string in json text.
##                                        ng men packed our u-haul with a
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid string in json text.
##                                        ng men packed our u-haul with a
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 129888 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bq.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        lo\". \n\nMy surgeon friend and
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        lo\". \n\nMy surgeon friend and
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 122928 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_br.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ey took a bit more off the aski
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ey took a bit more off the aski
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 134583 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bs.json 
##    stream_in failed: parse error: trailing garbage
##                                 ":5.0,"useful":1,"funny":0,"cool":0,"t
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : parse error: trailing garbage
##                                 ":5.0,"useful":1,"funny":0,"cool":0,"t
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 133785 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bt.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        We came here at 8:05 so there w
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        We came here at 8:05 so there w
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 132266 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bu.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        isit Charley's pretty regularly
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        isit Charley's pretty regularly
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 129827 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bv.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ble :\n\n- the place was smoky 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ble :\n\n- the place was smoky 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 126124 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bw.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        !! The tenderloin screens like 
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        !! The tenderloin screens like 
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 129762 rows
## Progress: 50 / 54 files
##   Loading (stream_in): yelp_academic_dataset_review_part_bx.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ce, been twice and all is wonde
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ce, been twice and all is wonde
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 134676 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_by.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        . \n\nTheir lines are always lo
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        . \n\nTheir lines are always lo
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 132296 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_bz.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        ompany. She was extremely rude.
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        ompany. She was extremely rude.
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 131805 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_ca.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        eared to be providing great ser
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        eared to be providing great ser
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 126594 rows
##   Loading (stream_in): yelp_academic_dataset_review_part_cb.json 
##    stream_in failed: lexical error: invalid char in json text.
##                                        pay the tole and go to jersey",
##                      (right here) ------^
##  
##   Trying line-by-line parsing...
##     Error on line 1 : lexical error: invalid char in json text.
##                                        pay the tole and go to jersey",
##                      (right here) ------^
##  
##   SUCCESS - line-by-line loaded 75849 rows

# Remove NULL entries
review_list <- Filter(Negate(is.null), review_list)
cat("\nSuccessfully loaded", length(review_list), "out of", length(review_files), "review files\n")

## 
## Successfully loaded 54 out of 54 review files

if(length(review_list) > 0) {
  cat("\nCombining review data (this may take a moment)...\n")
  reviews <- bind_rows(review_list)
  cat("\nCombined Reviews data:\n")
  cat("Rows:", nrow(reviews), "\n")
  cat("Columns:", ncol(reviews), "\n")
  if(ncol(reviews) > 0) {
    cat("Column names:", paste(names(reviews), collapse = ", "), "\n")
  }
  glimpse(reviews)
} else {
  cat("\n WARNING: No review data loaded!\n")
  reviews <- data.frame()
}

## 
## Combining review data (this may take a moment)...
## 
## Combined Reviews data:
## Rows: 6990227 
## Columns: 9 
## Column names: review_id, user_id, business_id, stars, useful, funny, cool, text, date 
## Rows: 6,990,227
## Columns: 9
## $ review_id   <chr> "KU_O5udG6zpxOg-VcAEodg", "BiTunyQ73aT9WBnpR9DZGw", "saUsX…
## $ user_id     <chr> "mh_-eMZ6K5RLWhZyISBhwA", "OyoGAe7OKpv6SyGZT5g77Q", "8g_iM…
## $ business_id <chr> "XQfwVwDr-v0ZS3_CbbE5Xw", "7ATYjTIgM3jUlt4UM3IypQ", "YjUWP…
## $ stars       <dbl> 3, 5, 3, 5, 4, 1, 5, 5, 3, 3, 5, 4, 4, 4, 4, 5, 5, 4, 5, 5…
## $ useful      <int> 0, 1, 0, 1, 1, 1, 0, 2, 1, 0, 2, 0, 0, 0, 0, 0, 0, 1, 2, 0…
## $ funny       <int> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0…
## $ cool        <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ text        <chr> "If you decide to eat here, just be aware it is going to t…
## $ date        <chr> "2018-07-07 22:09:11", "2012-01-03 15:28:18", "2014-02-05 …

Diagnostic Summary

cat("\n=== DIAGNOSTIC SUMMARY ===\n")

## 
## === DIAGNOSTIC SUMMARY ===

cat("Files found in repo:", length(all_files), "\n")

## Files found in repo: 59

cat("Business files identified:", length(business_files), "\n")

## Business files identified: 2

cat("Review files identified:", length(review_files), "\n")

## Review files identified: 54

cat("Business dataframes loaded:", length(business_list), "\n")

## Business dataframes loaded: 2

cat("Review dataframes loaded:", length(review_list), "\n")

## Review dataframes loaded: 54

cat("Total business rows:", if(exists("business")) nrow(business) else 0, "\n")

## Total business rows: 150345

cat("Total review rows:", if(exists("reviews")) nrow(reviews) else 0, "\n")

## Total review rows: 6990227

cat("\n=== DATA STATUS ===\n")

## 
## === DATA STATUS ===

has_business_data <- exists("business") && nrow(business) > 0
has_review_data <- exists("reviews") && nrow(reviews) > 0
cat("Business data available:", has_business_data, "\n")

## Business data available: TRUE

cat("Review data available:", has_review_data, "\n")

## Review data available: TRUE

if(!has_business_data) {
  cat("\n  WARNING: Business data not loaded properly\n")
}

Data Preparation

Check Data Availability

# Check if we have data to work with
has_business_data <- exists("business") && nrow(business) > 0
has_review_data <- exists("reviews") && nrow(reviews) > 0

cat("\n=== DATA AVAILABILITY CHECK ===\n")

## 
## === DATA AVAILABILITY CHECK ===

cat("Business data available:", has_business_data, "\n")

## Business data available: TRUE

cat("Review data available:", has_review_data, "\n")

## Review data available: TRUE

if(!has_business_data) {
  cat("\n  STOPPING: Cannot proceed without business data\n")
  cat("Please check the diagnostic output above to troubleshoot the JSON loading issue.\n")
  knitr::knit_exit()
}

Filter Yelp Data to California

# Only run if we have business data
if(has_business_data) {
  # Filter businesses to California area
  if("state" %in% names(business)) {
    business_CA <- business %>%
      filter(state == "CA")
    cat("CA Yelp businesses (filtered by state):", nrow(business_CA), "\n")
  } else if("city" %in% names(business)) {
    # Fallback to city-based filtering if state not available
    business_CA <- business %>%
      filter(grepl("Los Angeles|San Francisco|San Diego|Sacramento|San Jose", 
                   city, ignore.case = TRUE))
    cat("CA Yelp businesses (filtered by city):", nrow(business_CA), "\n")
  } else {
    business_CA <- business
    cat("CA Yelp businesses (no filtering applied):", nrow(business_CA), "\n")
  }
  
  cat("\nAvailable columns in business data:\n")
  print(names(business_CA))
  
  # Show state distribution
  if("state" %in% names(business_CA)) {
    cat("\nState distribution:\n")
    print(table(business_CA$state))
  }
}

## CA Yelp businesses (filtered by state): 5203 
## 
## Available columns in business data:
##  [1] "business_id"  "name"         "address"      "city"         "state"       
##  [6] "postal_code"  "latitude"     "longitude"    "stars"        "review_count"
## [11] "is_open"      "attributes"   "categories"   "hours"       
## 
## State distribution:
## 
##   CA 
## 5203

Prepare Restaurant Names for Matching

if(has_business_data && nrow(business_CA) > 0) {
  # Function to clean restaurant names for matching
  clean_name <- function(name) {
    name %>%
      str_to_lower() %>%
      str_replace_all("[^a-z0-9\\s]", "") %>%  # Remove special characters
      str_replace_all("\\s+", " ") %>%          # Normalize spaces
      str_trim()
  }
  
  # Clean names in CA restaurant dataset
  if("facility_name" %in% names(CA_Inspections)) {
    CA_Inspections <- CA_Inspections %>%
      mutate(clean_name = clean_name(facility_name))
    cat("Cleaned", nrow(CA_Inspections), "CA restaurant names\n")
  } else {
    cat("  'facility_name' column not found\n")
    cat("Available columns:", paste(names(CA_Inspections), collapse = ", "), "\n")
  }
  
  # Clean names in Yelp business dataset
  if("name" %in% names(business_CA)) {
    business_CA <- business_CA %>%
      mutate(clean_name = clean_name(name))
    cat("Cleaned", nrow(business_CA), "Yelp business names\n")
  } else {
    cat("  'name' column not found in business data\n")
    cat("Available columns:", paste(names(business_CA), collapse = ", "), "\n")
  }
}

## Cleaned 191371 CA restaurant names
## Cleaned 5203 Yelp business names

Join CA Inspections with Violations

# Join inspections with violations on serial_number
ca_data_combined <- CA_Inspections %>%
  left_join(CA_Violations, by = "serial_number", suffix = c("_insp", "_viol"))

cat("Combined CA data:\n")

## Combined CA data:

cat("Rows:", nrow(ca_data_combined), "\n")

## Rows: 901846

cat("Columns:", ncol(ca_data_combined), "\n")

## Columns: 25

# Check available columns
cat("\nAvailable columns in combined data:\n")

## 
## Available columns in combined data:

print(names(ca_data_combined))

##  [1] "activity_date"         "employee_id"           "facility_address"     
##  [4] "facility_city"         "facility_id"           "facility_name"        
##  [7] "facility_state"        "facility_zip"          "grade"                
## [10] "owner_id"              "owner_name"            "pe_description"       
## [13] "program_element_pe"    "program_name"          "program_status"       
## [16] "record_id"             "score"                 "serial_number"        
## [19] "service_code"          "service_description"   "clean_name"           
## [22] "points"                "violation_code"        "violation_description"
## [25] "violation_status"

# Check for violation-related columns
violation_cols <- names(ca_data_combined)[grepl("violation|points|critical|major|minor", 
                                                 names(ca_data_combined), 
                                                 ignore.case = TRUE)]
cat("\nViolation-related columns found:\n")

## 
## Violation-related columns found:

print(violation_cols)

## [1] "points"                "violation_code"        "violation_description"
## [4] "violation_status"

# Summary of violations per inspection (with flexible column checking)
violation_summary <- ca_data_combined %>%
  group_by(serial_number) %>%
  summarise(
    facility_name = first(facility_name),
    facility_address = first(facility_address),
    inspection_date = first(activity_date),
    score = first(score),
    grade = first(grade),
    num_violations = n(),
    # Flexible critical violation detection based on available columns
    has_critical = if("violation_status" %in% names(ca_data_combined)) {
      any(violation_status == "Critical", na.rm = TRUE)
    } else if("points_deducted" %in% names(ca_data_combined)) {
      any(points_deducted > 0, na.rm = TRUE)
    } else if("major_violation" %in% names(ca_data_combined)) {
      any(major_violation == 1 | major_violation == TRUE, na.rm = TRUE)
    } else {
      FALSE  # Default if no critical indicator found
    },
    total_points_deducted = if("points_deducted" %in% names(ca_data_combined)) {
      sum(points_deducted, na.rm = TRUE)
    } else {
      0
    },
    .groups = "drop"
  )

cat("\nViolation summary created:\n")

## 
## Violation summary created:

cat("Unique facilities:", n_distinct(violation_summary$facility_name), "\n")

## Unique facilities: 36627

glimpse(violation_summary)

## Rows: 191,371
## Columns: 9
## $ serial_number         <chr> "DA0007NP7", "DA000ADVG", "DA000AXL0", "DA000BTS…
## $ facility_name         <chr> "WHOLE FOODS MARKET #39", "JACK IN THE BOX #280"…
## $ facility_address      <chr> "18700 VENTURA BLVD", "10967 SANTA MONICA BLVD",…
## $ inspection_date       <date> 2017-08-16, 2017-10-27, 2017-08-02, 2015-10-29,…
## $ score                 <dbl> 98, 98, 95, 96, 92, 98, 90, 94, 87, 90, 94, 93, …
## $ grade                 <chr> "A", "A", "A", "A", "A", "A", "A", "A", "B", "A"…
## $ num_violations        <int> 2, 2, 4, 3, 5, 2, 7, 5, 9, 8, 6, 5, 3, 3, 3, 4, …
## $ has_critical          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ total_points_deducted <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Fuzzy Match Yelp Businesses with CA Restaurants

# Prepare Yelp data for matching
yelp_for_matching <- business_CA %>%
  filter(!is.na(name) & !is.na(address)) %>%
  mutate(
    clean_name = clean_name(name),
    clean_address = str_to_lower(str_trim(address))
  ) %>%
  select(business_id, name, address, clean_name, clean_address, 
         city, postal_code, stars, review_count)

# Prepare CA data for matching
ca_for_matching <- violation_summary %>%
  filter(!is.na(facility_name) & !is.na(facility_address)) %>%
  mutate(
    clean_name = clean_name(facility_name),
    clean_address = str_to_lower(str_trim(facility_address))
  )

cat("\nPreparing fuzzy match...\n")

## 
## Preparing fuzzy match...

cat("Yelp records for matching:", nrow(yelp_for_matching), "\n")

## Yelp records for matching: 5203

cat("CA records for matching:", nrow(ca_for_matching), "\n")

## CA records for matching: 191371

# Fuzzy join on name (distance-based)
matched_restaurants <- ca_for_matching %>%
  stringdist_left_join(
    yelp_for_matching,
    by = c("clean_name" = "clean_name"),
    max_dist = 3,  # Allow up to 3 character differences
    distance_col = "name_distance"
  ) %>%
  # Further filter by address similarity
  filter(
    is.na(name_distance) | name_distance <= 3,
    !is.na(business_id)
  ) %>%
  # Keep best match per facility
  group_by(serial_number) %>%
  arrange(name_distance) %>%
  slice(1) %>%
  ungroup()

cat("\n=== MATCHING RESULTS ===\n")

## 
## === MATCHING RESULTS ===

cat("Matched restaurants:", nrow(matched_restaurants), "\n")

## Matched restaurants: 23325

cat("Match rate:", round(nrow(matched_restaurants) / nrow(ca_for_matching) * 100, 2), "%\n")

## Match rate: 12.19 %

glimpse(matched_restaurants)

## Rows: 23,325
## Columns: 21
## $ serial_number         <chr> "DA0007NP7", "DA000LME7", "DA000NWBX", "DA000OA5…
## $ facility_name         <chr> "WHOLE FOODS MARKET #39", "WINGSTOP", "BOBA BEAR…
## $ facility_address      <chr> "18700 VENTURA BLVD", "1754 W SLAUSON AVE STE #A…
## $ inspection_date       <date> 2017-08-16, 2017-08-18, 2016-09-12, 2017-06-06,…
## $ score                 <dbl> 98, 87, 90, 93, 94, 94, 87, 96, 96, 93, 94, 96, …
## $ grade                 <chr> "A", "B", "A", "A", "A", "A", "B", "A", "A", "A"…
## $ num_violations        <int> 2, 9, 8, 5, 3, 5, 11, 4, 3, 6, 3, 4, 3, 3, 8, 1,…
## $ has_critical          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ total_points_deducted <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ clean_name.x          <chr> "whole foods market 39", "wingstop", "boba bear"…
## $ clean_address.x       <chr> "18700 ventura blvd", "1754 w slauson ave ste #a…
## $ business_id           <chr> "AMi1h-goNueHf_Lvx4Cu4g", "EBuSqlpzzyfZZkEgPxVV1…
## $ name                  <chr> "Whole Foods Market", "Wingstop", "Polar Bear", …
## $ address               <chr> "3761 State St", "3849 State St, Ste 163", "726 …
## $ clean_name.y          <chr> "whole foods market", "wingstop", "polar bear", …
## $ clean_address.y       <chr> "3761 state st", "3849 state st, ste 163", "726 …
## $ city                  <chr> "Santa Barbara", "Santa Barbara", "Santa Barbara…
## $ postal_code           <chr> "93105", "93105", "93101", "93101", "93105", "93…
## $ stars                 <dbl> 3.0, 3.0, 5.0, 4.0, 2.5, 2.5, 4.0, 3.0, 4.0, 4.0…
## $ review_count          <int> 224, 36, 6, 142, 56, 113, 9, 41, 26, 91, 189, 33…
## $ name_distance         <dbl> 3, 0, 3, 1, 0, 2, 3, 3, 2, 3, 3, 0, 0, 0, 3, 0, …

Join with Yelp Reviews

if(has_review_data && nrow(matched_restaurants) > 0) {
  # Get reviews for matched businesses
  matched_reviews <- reviews %>%
    semi_join(matched_restaurants, by = "business_id") %>%
    left_join(
      matched_restaurants %>% 
        select(business_id, serial_number, facility_name, score, grade, 
               num_violations, has_critical, total_points_deducted),
      by = "business_id"
    )
  
  cat("\n=== REVIEW MATCHING RESULTS ===\n")
  cat("Matched reviews:", nrow(matched_reviews), "\n")
  cat("Unique businesses with reviews:", n_distinct(matched_reviews$business_id), "\n")
  
  glimpse(matched_reviews)
} else {
  cat("\n Cannot join reviews - either no review data or no matched restaurants\n")
  matched_reviews <- data.frame()
}

## 
## === REVIEW MATCHING RESULTS ===
## Matched reviews: 2420067 
## Unique businesses with reviews: 671 
## Rows: 2,420,067
## Columns: 16
## $ review_id             <chr> "eCiWBf1CJ0Zdv1uVarEhhw", "eCiWBf1CJ0Zdv1uVarEhh…
## $ user_id               <chr> "OhECKhQEexFypOMY6kypRw", "OhECKhQEexFypOMY6kypR…
## $ business_id           <chr> "vC2qm1y3Au5czBtbhc-DNw", "vC2qm1y3Au5czBtbhc-DN…
## $ stars                 <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ useful                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ funny                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ cool                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ text                  <chr> "Yes, this is the only sushi place in town. Howe…
## $ date                  <chr> "2013-09-04 03:48:20", "2013-09-04 03:48:20", "2…
## $ serial_number         <chr> "DA035WDAH", "DA3V1F5FG", "DA8I4EIPG", "DA9QEITK…
## $ facility_name         <chr> "SUSHI FUMI", "SUSHI & TERI", "SUSHI FIRE", "SUS…
## $ score                 <dbl> 90, 92, 94, 92, 92, 95, 90, 95, 94, 90, 90, 91, …
## $ grade                 <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"…
## $ num_violations        <int> 7, 4, 5, 7, 7, 5, 4, 2, 4, 7, 9, 7, 7, 6, 5, 6, …
## $ has_critical          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ total_points_deducted <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Hypothesis Testing

Hypothesis 1: Inspection Scores vs Yelp Ratings

H0: There is no correlation between inspection scores and Yelp star ratings
H1: Higher inspection scores correlate with higher Yelp ratings

if(nrow(matched_restaurants) > 0) {
  # Aggregate data for correlation
  h1_data <- matched_restaurants %>%
    filter(!is.na(score) & !is.na(stars) & score > 0) %>%
    select(facility_name, score, stars, num_violations, grade)
  
  # Correlation test
  cor_test <- cor.test(h1_data$score, h1_data$stars, method = "pearson")
  
  cat("\n=== HYPOTHESIS 1 RESULTS ===\n")
  cat("Sample size:", nrow(h1_data), "\n")
  cat("Correlation coefficient:", round(cor_test$estimate, 4), "\n")
  cat("P-value:", format.pval(cor_test$p.value, digits = 4), "\n")
  cat("95% Confidence Interval: [", round(cor_test$conf.int[1], 4), ", ", 
      round(cor_test$conf.int[2], 4), "]\n", sep = "")
  
  if(cor_test$p.value < 0.05) {
    cat("REJECT H0: Significant correlation exists\n")
  } else {
    cat(" FAIL TO REJECT H0: No significant correlation\n")
  }
  
  # Visualization
  ggplot(h1_data, aes(x = score, y = stars)) +
    geom_point(alpha = 0.4, color = "steelblue") +
    geom_smooth(method = "lm", color = "darkred", se = TRUE) +
    labs(
      title = "Inspection Score vs Yelp Star Rating",
      subtitle = paste0("Correlation: ", round(cor_test$estimate, 3), 
                        " (p = ", format.pval(cor_test$p.value, digits = 3), ")"),
      x = "Health Inspection Score (Higher = Better)",
      y = "Yelp Star Rating"
    ) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"),
          plot.subtitle = element_text(hjust = 0.5))
}

## 
## === HYPOTHESIS 1 RESULTS ===
## Sample size: 23325 
## Correlation coefficient: -0.1237 
## P-value: < 2.2e-16 
## 95% Confidence Interval: [-0.1363, -0.111]
## REJECT H0: Significant correlation exists

Interpretation: There’s a statistically significant but weak negative correlation (-0.124) between health inspection scores and Yelp ratings. Oddly, restaurants with better health scores tend to have slightly lower Yelp ratings. The relationship is extremely weak though, and the scatter plot shows ratings are heavily clustered at whole-number values (3.0, 4.0, 5.0 stars) with minimal variation across inspection scores.

Hypothesis 2: Critical Violations vs Review Sentiment

H0: Restaurants with critical violations have the same average Yelp rating as those without
H1: Restaurants with critical violations have lower average Yelp ratings

if(nrow(matched_reviews) > 0 && "has_critical" %in% names(matched_reviews)) {
  # Calculate average rating by critical violation status
  h2_data <- matched_reviews %>%
    group_by(business_id, has_critical) %>%
    summarise(
      avg_stars = mean(stars, na.rm = TRUE),
      facility_name = first(facility_name),
      .groups = "drop"
    ) %>%
    filter(!is.na(has_critical) & !is.na(avg_stars))
  
  # Two-sample t-test with sufficient sample check
  critical_yes <- h2_data %>% filter(has_critical == TRUE) %>% pull(avg_stars)
  critical_no <- h2_data %>% filter(has_critical == FALSE) %>% pull(avg_stars)
  
  cat("\n=== HYPOTHESIS 2 RESULTS ===\n")
  cat("Sample with critical violations:", length(critical_yes), "\n")
  cat("Sample without critical violations:", length(critical_no), "\n")
  
  # Check if we have enough data for t-test (need at least 2 observations in each group)
  if(length(critical_yes) >= 2 && length(critical_no) >= 2) {
    cat("Mean rating (with critical violations):", round(mean(critical_yes), 3), "\n")
    cat("Mean rating (without critical violations):", round(mean(critical_no), 3), "\n")
    cat("Difference:", round(mean(critical_yes) - mean(critical_no), 3), "\n")
    
    t_test <- t.test(critical_yes, critical_no, alternative = "less")
    
    cat("T-statistic:", round(t_test$statistic, 4), "\n")
    cat("P-value:", format.pval(t_test$p.value, digits = 4), "\n")
    
    if(t_test$p.value < 0.05) {
      cat("REJECT H0: Critical violations associated with lower ratings\n")
    } else {
      cat(" FAIL TO REJECT H0: No significant difference\n")
    }
    
    # Visualization
    ggplot(h2_data, aes(x = has_critical, y = avg_stars, fill = has_critical)) +
      geom_boxplot(alpha = 0.7) +
      geom_jitter(width = 0.2, alpha = 0.3) +
      scale_fill_manual(values = c("TRUE" = "#E74C3C", "FALSE" = "#2ECC71")) +
      labs(
        title = "Yelp Ratings: Critical Violations vs None",
        subtitle = paste0("p-value: ", format.pval(t_test$p.value, digits = 3)),
        x = "Has Critical Violations",
        y = "Average Yelp Star Rating",
        fill = "Critical Violations"
      ) +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold"),
            plot.subtitle = element_text(hjust = 0.5))
  } else {
    cat("\n INSUFFICIENT DATA for t-test\n")
    cat("T-test requires at least 2 observations in each group.\n")
    
    # Show descriptive statistics instead
    if(length(critical_yes) > 0) {
      cat("Mean rating (with critical violations):", round(mean(critical_yes), 3), "\n")
    }
    if(length(critical_no) > 0) {
      cat("Mean rating (without critical violations):", round(mean(critical_no), 3), "\n")
    }
    
    # Simple visualization if we have any data
    if(nrow(h2_data) > 0) {
      ggplot(h2_data, aes(x = has_critical, y = avg_stars, fill = has_critical)) +
        geom_point(alpha = 0.5, size = 3) +
        scale_fill_manual(values = c("TRUE" = "#E74C3C", "FALSE" = "#2ECC71")) +
        labs(
          title = "Yelp Ratings: Critical Violations vs None",
          subtitle = "Insufficient data for statistical test",
          x = "Has Critical Violations",
          y = "Average Yelp Star Rating",
          fill = "Critical Violations"
        ) +
        theme_minimal() +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"),
              plot.subtitle = element_text(hjust = 0.5))
    }
  }
} else {
  cat("\n Cannot perform Hypothesis 2 test - missing required data\n")
}

## 
## === HYPOTHESIS 2 RESULTS ===
## Sample with critical violations: 0 
## Sample without critical violations: 671 
## 
##  INSUFFICIENT DATA for t-test
## T-test requires at least 2 observations in each group.
## Mean rating (without critical violations): 3.684

Interpretation: The chart indicates insufficient data for statistical testing, but shows that among restaurants without critical violations, Yelp ratings span the full range from 1 to 5 stars. This suggests critical violations alone don’t determine customer ratings.

Hypothesis 3: Number of Violations vs Review Count

H0: The number of violations does not affect review volume
H1: Restaurants with more violations receive more reviews (negative publicity effect)

if(nrow(matched_restaurants) > 0) {
  h3_data <- matched_restaurants %>%
    filter(!is.na(num_violations) & !is.na(review_count)) %>%
    select(facility_name, num_violations, review_count, stars)
  
  # Correlation test
  cor_test_h3 <- cor.test(h3_data$num_violations, h3_data$review_count, 
                          method = "spearman")
  
  cat("\n=== HYPOTHESIS 3 RESULTS ===\n")
  cat("Sample size:", nrow(h3_data), "\n")
  cat("Spearman correlation:", round(cor_test_h3$estimate, 4), "\n")
  cat("P-value:", format.pval(cor_test_h3$p.value, digits = 4), "\n")
  
  if(cor_test_h3$p.value < 0.05) {
    if(cor_test_h3$estimate > 0) {
      cat("REJECT H0: More violations associated with more reviews\n")
    } else {
      cat("REJECT H0: More violations associated with fewer reviews\n")
    }
  } else {
    cat(" FAIL TO REJECT H0: No significant relationship\n")
  }
  
  # Visualization
  ggplot(h3_data, aes(x = num_violations, y = review_count)) +
    geom_point(alpha = 0.4, color = "steelblue") +
    geom_smooth(method = "lm", color = "darkred", se = TRUE) +
    scale_y_log10() +
    labs(
      title = "Violations vs Review Volume",
      subtitle = paste0("Spearman \rho: ", round(cor_test_h3$estimate, 3),
                        " (p = ", format.pval(cor_test_h3$p.value, digits = 3), ")"),
      x = "Number of Violations",
      y = "Review Count (log scale)"
    ) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"),
          plot.subtitle = element_text(hjust = 0.5))
}

## 
## === HYPOTHESIS 3 RESULTS ===
## Sample size: 23325 
## Spearman correlation: 0.0221 
## P-value: 0.0007216 
## REJECT H0: More violations associated with more reviews

Interpretation: A very weak positive correlation exists between violation counts and review volume. Restaurants with more reviews tend to have slightly more violations, though the effect is negligible. The log scale reveals most restaurants cluster in the 0-10 violation range regardless of review count.

Yelp Rating Distribution by Grade

if(nrow(matched_restaurants) > 0) {
  grade_rating_data <- matched_restaurants %>%
    filter(!is.na(grade) & !is.na(stars) & grade %in% c("A", "B", "C"))
  
  if(nrow(grade_rating_data) > 0) {
    ggplot(grade_rating_data, aes(x = stars, fill = grade)) +
      geom_density(alpha = 0.6) +
      scale_fill_manual(values = c("A" = "#2ECC71", "B" = "#F39C12", "C" = "#E74C3C")) +
      labs(
        title = "Yelp Star Rating Distribution by Inspection Grade",
        x = "Yelp Star Rating",
        y = "Density",
        fill = "Grade"
      ) +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold"))
  }
}

Interpretation: Grade A restaurants show a sharp peak around 3-star ratings, Grade B restaurants have a broader distribution centered around 3.5-4 stars, while Grade C restaurants show the widest spread with substantial density across 3-5 stars. This suggests inspection grades and customer ratings measure different aspects of restaurant quality.

Heatmap: Violations vs Ratings

if(nrow(matched_restaurants) > 0) {
  heatmap_data <- matched_restaurants %>%
    filter(!is.na(stars) & !is.na(num_violations)) %>%
    mutate(
      violation_category = cut(num_violations, 
                               breaks = c(0, 2, 5, 10, Inf),
                               labels = c("0-2", "3-5", "6-10", "10+"),
                               include.lowest = TRUE),
      rating_category = cut(stars,
                           breaks = c(0, 2, 3, 4, 5),
                           labels = c("1-2 stars", "3 stars", "4 stars", "5 stars"),
                           include.lowest = TRUE)
    ) %>%
    count(violation_category, rating_category)
  
  if(nrow(heatmap_data) > 0) {
    ggplot(heatmap_data, aes(x = violation_category, y = rating_category, fill = n)) +
      geom_tile(color = "white") +
      geom_text(aes(label = n), color = "white", fontface = "bold") +
      scale_fill_gradient(low = "#3498db", high = "#e74c3c") +
      labs(
        title = "Restaurant Distribution: Violations vs Yelp Ratings",
        x = "Number of Violations",
        y = "Yelp Star Rating",
        fill = "Count"
      ) +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold"))
  }
}

Interpretation: The vast majority of restaurants fall into the 3-5 violation range with 3-4 star ratings (shown by the dark red cells). Very few restaurants have 10+ violations, and those that do still maintain respectable 3-5 star ratings. Low-rated restaurants (1-2 stars) are relatively rare across all violation categories.

Overall Conclusion:

This analysis shows that California’s health inspection scores and Yelp ratings capture different dimensions of restaurant quality. While inspection scores reflect regulatory compliance and food safety, Yelp ratings are shaped more by customer experience, taste, service, and expectations. The statistical relationships between the two are consistently weak or negligible: Restaurants with higher inspection scores sometimes have slightly lower Yelp ratings, but the effect is minimal. Critical violations do not reliably predict poor Yelp reviews. Review volume has almost no meaningful connection to violation counts. Inspection grades and Yelp ratings display distinct distribution patterns, underscoring that they measure separate aspects of performance. Taken together, the findings suggest that public perception and regulatory outcomes operate in parallel rather than in sync. Customers rarely reward or penalize restaurants based on health inspection results, and Yelp reviews cannot be used as a proxy for food safety. For policymakers and hospitality managers, this highlights the importance of treating inspection data and consumer sentiment as complementary but independent tools. Health departments safeguard public safety, while Yelp reviews reflect customer satisfaction. Both perspectives are valuable, but neither alone provides a complete picture of restaurant quality.

References:

Save Processed Data for Presentation

# Save all the data objects needed for visualizations
save(
  h1_data, 
  h2_data, 
  h3_data, 
  grade_rating_data, 
  heatmap_data,
  matched_restaurants,
  file = "analysis_data.RData"
)

cat("Data saved successfully for presentation deck!\n")

## Data saved successfully for presentation deck!

Data607 - Final Project

Paula Brown

2025-12-16