Geographic Data Quality in James Beard Awards Dataset

Author

Generated with assistance from Claude AI

Published

March 27, 2025

Note

This report was generated using artificial intelligence (Claude from Anthropic) under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.

Introduction

The James Beard Awards are prestigious honors presented for excellence in cuisine, culinary writing, and culinary education in the United States. This analysis examines data quality issues in the geographical information present in the December 31, 2024 TidyTuesday dataset focusing on these awards.

Loading and Initial Examination

First, let’s load the required packages and data:

library(tidyverse)

Warning: package 'purrr' was built under R version 4.3.3

library(tidytuesdayR)

Warning: package 'tidytuesdayR' was built under R version 4.3.3

# Get the data
tuesdata <- tidytuesdayR::tt_load('2024-12-31')
restaurant_data <- tuesdata$restaurant_and_chef

The restaurant and chef awards dataset contains 10024 entries with the following structure:

glimpse(restaurant_data)

Rows: 10,024
Columns: 6
$ subcategory <chr> "Best Chef Mid-Atlantic", "Best Chef Mid-Atlantic", "Best …
$ rank        <chr> "Semifinalist", "Semifinalist", "Nominee", "Semifinalist",…
$ year        <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024…
$ name        <chr> "Mona Tongdee", "Yuan Tang", "Matthew Kern", "Daniel Kleno…
$ restaurant  <chr> "Pusadee’s Garden", "Rooster & Owl", "One Coastal", "Bellf…
$ city        <chr> "Pittsburgh, Pennsylvania", "Washington, District of Colum…

Data Quality Investigation

Missing Location Values

Let’s examine the extent of missing values in the city field:

# Count total NAs in city field
na_count <- sum(is.na(restaurant_data$city))
total_rows <- nrow(restaurant_data)

cat(sprintf("Number of NA values in city field: %d (%.1f%%)\n", 
    na_count, 100 * na_count/total_rows))

Number of NA values in city field: 3471 (34.6%)

Pattern Analysis

Let’s examine how these NA values are distributed across different award subcategories:

restaurant_data %>%
  filter(is.na(city)) %>%
  count(subcategory, sort = TRUE) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  head(10) %>%
  knitr::kable(
    col.names = c("Subcategory", "Count of NAs", "% of Total NAs"),
    digits = 1
  )

Subcategory	Count of NAs	% of Total NAs
Best New Restaurant	762	22.0
Outstanding Restaurant	635	18.3
Outstanding Service	479	13.8
Outstanding Bar Program	277	8.0
Outstanding Wine Service	277	8.0
Who’s Who of Food & Beverage in America	275	7.9
Outstanding Wine Program	204	5.9
America’s Classics	161	4.6
Outstanding Hospitality	85	2.4
Outstanding Wine & Other Beverages Program	84	2.4

Source of the Problem

Looking at some example rows where city is NA but location information exists in the restaurant field:

restaurant_data %>%
  filter(is.na(city) & !is.na(restaurant)) %>%
  select(subcategory, restaurant) %>%
  head(10) %>%
  knitr::kable()

subcategory	restaurant
Outstanding Bar	Brooklyn, New York
Outstanding Wine & Other Beverages Program	Washington D.C., District of Columbia
Outstanding Bakery	New York, New York
Outstanding Bakery	Albuquerque, New Mexico
Outstanding Restaurant	Marfa, Texas
Outstanding Hospitality	Dallas, Texas
Outstanding Wine & Other Beverages Program	Chicago, Illinois
Best New Restaurant	Alna, Maine
Outstanding Restaurant	Warren, Rhode Island
Best New Restaurant	Denver, Colorado

We can see that for many entries, especially those focusing on establishments rather than individual chefs, the location information was placed in the restaurant field instead of the city field.

Data Cleaning Solution

Methodology

To address this issue, we:

Use the restaurant field as a fallback when city is NA
Split the location information into city and state components
Standardize state names (e.g., “District of Columbia” to “DC”)

Here’s the cleaning process:

# Create a cleaned version of the geography data
restaurant_geo_clean <- restaurant_data %>%
  # When city is NA, try to get location from restaurant field
  mutate(location = if_else(is.na(city), restaurant, city)) %>%
  # Separate into city and state
  separate(location, into = c("city_clean", "state_clean"), 
           sep = ", ", remove = FALSE) %>%
  # Clean up "District of Columbia" to be consistent
  mutate(state_clean = str_replace(state_clean, 
                                  "District of Columbia", "DC"))

Warning: Expected 2 pieces. Additional pieces discarded in 10 rows [4890, 6258, 6985,
7711, 7927, 7989, 8583, 9024, 9262, 9493].

Warning: Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [738, 895, 1194,
1239, 1394, 4516, 4634, 5438, 5468, 5486, 5490, 5493, 5539, 5640, 5644, 5650,
5661, 5669, 5670, 5687, ...].

Results

Let’s examine how many states we can identify after cleaning:

# Count distinct states in cleaned data
states_identified <- restaurant_geo_clean %>%
  filter(!is.na(state_clean)) %>%
  summarize(
    n_states = n_distinct(state_clean),
    n_cities = n_distinct(city_clean)
  )

cat(sprintf("After cleaning:\n- Distinct states identified: %d\n- Distinct cities identified: %d\n",
    states_identified$n_states, states_identified$n_cities))

After cleaning:
- Distinct states identified: 63
- Distinct cities identified: 687

Here’s the distribution of awards across states after cleaning:

state_counts_clean <- restaurant_geo_clean %>%
  count(state_clean, sort = TRUE) %>%
  filter(!is.na(state_clean)) %>%
  head(10)

ggplot(state_counts_clean, aes(x = reorder(state_clean, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 States by Number of James Beard Awards/Nominations",
       subtitle = "After cleaning and combining location data",
       x = "State",
       y = "Number of Awards/Nominations") +
  theme_minimal()

Remaining Challenges

Some data quality issues remain:

There are still some NA values in the cleaned state field:

remaining_nas <- sum(is.na(restaurant_geo_clean$state_clean))
cat(sprintf("Remaining NA values: %d (%.1f%%)\n", 
    remaining_nas, 100 * remaining_nas/nrow(restaurant_geo_clean)))

Remaining NA values: 143 (1.4%)

Some location strings contain additional information that causes parsing errors, as evidenced by warning messages during the separation process.
There may be inconsistencies in city names that could benefit from further standardization.

Let’s look at some examples of records where we still have missing location data:

restaurant_geo_clean %>%
  filter(is.na(state_clean)) %>%
  select(subcategory, name, restaurant, location) %>%
  head(10) %>%
  knitr::kable()

subcategory	name	restaurant	location
Best Chef: Southwest	Gina Marinelli	Harlo	Las Vegas
Lifetime Achievement	Madhur Jaffrey	NA	NA
Humanitarian of the Year	Karen Washington	NA	NA
Best Chef: Southwest	Gina Marinelli	Harlo	Las Vegas
Humanitarian of the Year	Grace Young	NA	NA
Humanitarian of the Year	Grace Young	NA	NA
Best Chefs	Angel Barreto	Anju	Washington
Best Chefs	Angel Barreto	Anju	Washington
Humanitarian of the Year	Grace Young	NA	NA
Best Chefs	Angel Barreto	Anju	Washington

Conclusions

A significant portion of geographic information was stored inconsistently across fields, with location data sometimes appearing in the restaurant field rather than the city field.
The data quality issues were not random but followed patterns based on award categories, particularly affecting establishment-focused awards rather than individual chef awards.
Through a relatively simple cleaning process, we were able to substantially improve the completeness of geographic information in the dataset.
Further cleaning could be beneficial, particularly for:
- Standardizing city names
- Handling complex location strings
- Investigating remaining NA values to determine if location information exists in other fields or formats