Geographic Data Quality in James Beard Awards Dataset
Author
Generated with assistance from Claude AI
Published
March 27, 2025
Note
This report was generated using artificial intelligence (Claude from Anthropic) under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.
Introduction
The James Beard Awards are prestigious honors presented for excellence in cuisine, culinary writing, and culinary education in the United States. This analysis examines data quality issues in the geographical information present in the December 31, 2024 TidyTuesday dataset focusing on these awards.
Loading and Initial Examination
First, let’s load the required packages and data:
library(tidyverse)
Warning: package 'purrr' was built under R version 4.3.3
library(tidytuesdayR)
Warning: package 'tidytuesdayR' was built under R version 4.3.3
# Get the datatuesdata <- tidytuesdayR::tt_load('2024-12-31')restaurant_data <- tuesdata$restaurant_and_chef
The restaurant and chef awards dataset contains 10024 entries with the following structure:
glimpse(restaurant_data)
Rows: 10,024
Columns: 6
$ subcategory <chr> "Best Chef Mid-Atlantic", "Best Chef Mid-Atlantic", "Best …
$ rank <chr> "Semifinalist", "Semifinalist", "Nominee", "Semifinalist",…
$ year <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024…
$ name <chr> "Mona Tongdee", "Yuan Tang", "Matthew Kern", "Daniel Kleno…
$ restaurant <chr> "Pusadee’s Garden", "Rooster & Owl", "One Coastal", "Bellf…
$ city <chr> "Pittsburgh, Pennsylvania", "Washington, District of Colum…
Data Quality Investigation
Missing Location Values
Let’s examine the extent of missing values in the city field:
# Count total NAs in city fieldna_count <-sum(is.na(restaurant_data$city))total_rows <-nrow(restaurant_data)cat(sprintf("Number of NA values in city field: %d (%.1f%%)\n", na_count, 100* na_count/total_rows))
Number of NA values in city field: 3471 (34.6%)
Pattern Analysis
Let’s examine how these NA values are distributed across different award subcategories:
restaurant_data %>%filter(is.na(city)) %>%count(subcategory, sort =TRUE) %>%mutate(percentage = n/sum(n) *100) %>%head(10) %>% knitr::kable(col.names =c("Subcategory", "Count of NAs", "% of Total NAs"),digits =1 )
Subcategory
Count of NAs
% of Total NAs
Best New Restaurant
762
22.0
Outstanding Restaurant
635
18.3
Outstanding Service
479
13.8
Outstanding Bar Program
277
8.0
Outstanding Wine Service
277
8.0
Who’s Who of Food & Beverage in America
275
7.9
Outstanding Wine Program
204
5.9
America’s Classics
161
4.6
Outstanding Hospitality
85
2.4
Outstanding Wine & Other Beverages Program
84
2.4
Source of the Problem
Looking at some example rows where city is NA but location information exists in the restaurant field:
We can see that for many entries, especially those focusing on establishments rather than individual chefs, the location information was placed in the restaurant field instead of the city field.
Data Cleaning Solution
Methodology
To address this issue, we:
Use the restaurant field as a fallback when city is NA
Split the location information into city and state components
Standardize state names (e.g., “District of Columbia” to “DC”)
Here’s the cleaning process:
# Create a cleaned version of the geography datarestaurant_geo_clean <- restaurant_data %>%# When city is NA, try to get location from restaurant fieldmutate(location =if_else(is.na(city), restaurant, city)) %>%# Separate into city and stateseparate(location, into =c("city_clean", "state_clean"), sep =", ", remove =FALSE) %>%# Clean up "District of Columbia" to be consistentmutate(state_clean =str_replace(state_clean, "District of Columbia", "DC"))
Let’s examine how many states we can identify after cleaning:
# Count distinct states in cleaned datastates_identified <- restaurant_geo_clean %>%filter(!is.na(state_clean)) %>%summarize(n_states =n_distinct(state_clean),n_cities =n_distinct(city_clean) )cat(sprintf("After cleaning:\n- Distinct states identified: %d\n- Distinct cities identified: %d\n", states_identified$n_states, states_identified$n_cities))
After cleaning:
- Distinct states identified: 63
- Distinct cities identified: 687
Here’s the distribution of awards across states after cleaning:
state_counts_clean <- restaurant_geo_clean %>%count(state_clean, sort =TRUE) %>%filter(!is.na(state_clean)) %>%head(10)ggplot(state_counts_clean, aes(x =reorder(state_clean, n), y = n)) +geom_bar(stat ="identity", fill ="steelblue") +coord_flip() +labs(title ="Top 10 States by Number of James Beard Awards/Nominations",subtitle ="After cleaning and combining location data",x ="State",y ="Number of Awards/Nominations") +theme_minimal()
Remaining Challenges
Some data quality issues remain:
There are still some NA values in the cleaned state field:
remaining_nas <-sum(is.na(restaurant_geo_clean$state_clean))cat(sprintf("Remaining NA values: %d (%.1f%%)\n", remaining_nas, 100* remaining_nas/nrow(restaurant_geo_clean)))
Remaining NA values: 143 (1.4%)
Some location strings contain additional information that causes parsing errors, as evidenced by warning messages during the separation process.
There may be inconsistencies in city names that could benefit from further standardization.
Let’s look at some examples of records where we still have missing location data:
A significant portion of geographic information was stored inconsistently across fields, with location data sometimes appearing in the restaurant field rather than the city field.
The data quality issues were not random but followed patterns based on award categories, particularly affecting establishment-focused awards rather than individual chef awards.
Through a relatively simple cleaning process, we were able to substantially improve the completeness of geographic information in the dataset.
Further cleaning could be beneficial, particularly for:
Standardizing city names
Handling complex location strings
Investigating remaining NA values to determine if location information exists in other fields or formats