Geographic Data Quality in James Beard Awards Dataset

Author

Generated with assistance from Claude AI

Published

March 27, 2025

Note

This report was generated using artificial intelligence (Claude from Anthropic) under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.

Introduction

The James Beard Awards are prestigious honors presented for excellence in cuisine, culinary writing, and culinary education in the United States. This analysis examines data quality issues in the geographical information present in the December 31, 2024 TidyTuesday dataset focusing on these awards.

Loading and Initial Examination

First, let’s load the required packages and data:

library(tidyverse)
Warning: package 'purrr' was built under R version 4.3.3
library(tidytuesdayR)
Warning: package 'tidytuesdayR' was built under R version 4.3.3
# Get the data
tuesdata <- tidytuesdayR::tt_load('2024-12-31')
restaurant_data <- tuesdata$restaurant_and_chef

The restaurant and chef awards dataset contains 10024 entries with the following structure:

glimpse(restaurant_data)
Rows: 10,024
Columns: 6
$ subcategory <chr> "Best Chef Mid-Atlantic", "Best Chef Mid-Atlantic", "Best …
$ rank        <chr> "Semifinalist", "Semifinalist", "Nominee", "Semifinalist",…
$ year        <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024…
$ name        <chr> "Mona Tongdee", "Yuan Tang", "Matthew Kern", "Daniel Kleno…
$ restaurant  <chr> "Pusadee’s Garden", "Rooster & Owl", "One Coastal", "Bellf…
$ city        <chr> "Pittsburgh, Pennsylvania", "Washington, District of Colum…

Data Quality Investigation

Missing Location Values

Let’s examine the extent of missing values in the city field:

# Count total NAs in city field
na_count <- sum(is.na(restaurant_data$city))
total_rows <- nrow(restaurant_data)

cat(sprintf("Number of NA values in city field: %d (%.1f%%)\n", 
    na_count, 100 * na_count/total_rows))
Number of NA values in city field: 3471 (34.6%)

Pattern Analysis

Let’s examine how these NA values are distributed across different award subcategories:

restaurant_data %>%
  filter(is.na(city)) %>%
  count(subcategory, sort = TRUE) %>%
  mutate(percentage = n/sum(n) * 100) %>%
  head(10) %>%
  knitr::kable(
    col.names = c("Subcategory", "Count of NAs", "% of Total NAs"),
    digits = 1
  )
Subcategory Count of NAs % of Total NAs
Best New Restaurant 762 22.0
Outstanding Restaurant 635 18.3
Outstanding Service 479 13.8
Outstanding Bar Program 277 8.0
Outstanding Wine Service 277 8.0
Who’s Who of Food & Beverage in America 275 7.9
Outstanding Wine Program 204 5.9
America’s Classics 161 4.6
Outstanding Hospitality 85 2.4
Outstanding Wine & Other Beverages Program 84 2.4

Source of the Problem

Looking at some example rows where city is NA but location information exists in the restaurant field:

restaurant_data %>%
  filter(is.na(city) & !is.na(restaurant)) %>%
  select(subcategory, restaurant) %>%
  head(10) %>%
  knitr::kable()
subcategory restaurant
Outstanding Bar Brooklyn, New York
Outstanding Wine & Other Beverages Program Washington D.C., District of Columbia
Outstanding Bakery New York, New York
Outstanding Bakery Albuquerque, New Mexico
Outstanding Restaurant Marfa, Texas
Outstanding Hospitality Dallas, Texas
Outstanding Wine & Other Beverages Program Chicago, Illinois
Best New Restaurant Alna, Maine
Outstanding Restaurant Warren, Rhode Island
Best New Restaurant Denver, Colorado

We can see that for many entries, especially those focusing on establishments rather than individual chefs, the location information was placed in the restaurant field instead of the city field.

Data Cleaning Solution

Methodology

To address this issue, we:

  1. Use the restaurant field as a fallback when city is NA
  2. Split the location information into city and state components
  3. Standardize state names (e.g., “District of Columbia” to “DC”)

Here’s the cleaning process:

# Create a cleaned version of the geography data
restaurant_geo_clean <- restaurant_data %>%
  # When city is NA, try to get location from restaurant field
  mutate(location = if_else(is.na(city), restaurant, city)) %>%
  # Separate into city and state
  separate(location, into = c("city_clean", "state_clean"), 
           sep = ", ", remove = FALSE) %>%
  # Clean up "District of Columbia" to be consistent
  mutate(state_clean = str_replace(state_clean, 
                                  "District of Columbia", "DC"))
Warning: Expected 2 pieces. Additional pieces discarded in 10 rows [4890, 6258, 6985,
7711, 7927, 7989, 8583, 9024, 9262, 9493].
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [738, 895, 1194,
1239, 1394, 4516, 4634, 5438, 5468, 5486, 5490, 5493, 5539, 5640, 5644, 5650,
5661, 5669, 5670, 5687, ...].

Results

Let’s examine how many states we can identify after cleaning:

# Count distinct states in cleaned data
states_identified <- restaurant_geo_clean %>%
  filter(!is.na(state_clean)) %>%
  summarize(
    n_states = n_distinct(state_clean),
    n_cities = n_distinct(city_clean)
  )

cat(sprintf("After cleaning:\n- Distinct states identified: %d\n- Distinct cities identified: %d\n",
    states_identified$n_states, states_identified$n_cities))
After cleaning:
- Distinct states identified: 63
- Distinct cities identified: 687

Here’s the distribution of awards across states after cleaning:

state_counts_clean <- restaurant_geo_clean %>%
  count(state_clean, sort = TRUE) %>%
  filter(!is.na(state_clean)) %>%
  head(10)

ggplot(state_counts_clean, aes(x = reorder(state_clean, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 States by Number of James Beard Awards/Nominations",
       subtitle = "After cleaning and combining location data",
       x = "State",
       y = "Number of Awards/Nominations") +
  theme_minimal()

Remaining Challenges

Some data quality issues remain:

  1. There are still some NA values in the cleaned state field:
remaining_nas <- sum(is.na(restaurant_geo_clean$state_clean))
cat(sprintf("Remaining NA values: %d (%.1f%%)\n", 
    remaining_nas, 100 * remaining_nas/nrow(restaurant_geo_clean)))
Remaining NA values: 143 (1.4%)
  1. Some location strings contain additional information that causes parsing errors, as evidenced by warning messages during the separation process.

  2. There may be inconsistencies in city names that could benefit from further standardization.

Let’s look at some examples of records where we still have missing location data:

restaurant_geo_clean %>%
  filter(is.na(state_clean)) %>%
  select(subcategory, name, restaurant, location) %>%
  head(10) %>%
  knitr::kable()
subcategory name restaurant location
Best Chef: Southwest Gina Marinelli Harlo Las Vegas
Lifetime Achievement Madhur Jaffrey NA NA
Humanitarian of the Year Karen Washington NA NA
Best Chef: Southwest Gina Marinelli Harlo Las Vegas
Humanitarian of the Year Grace Young NA NA
Humanitarian of the Year Grace Young NA NA
Best Chefs Angel Barreto Anju Washington
Best Chefs Angel Barreto Anju Washington
Humanitarian of the Year Grace Young NA NA
Best Chefs Angel Barreto Anju Washington

Conclusions

  1. A significant portion of geographic information was stored inconsistently across fields, with location data sometimes appearing in the restaurant field rather than the city field.

  2. The data quality issues were not random but followed patterns based on award categories, particularly affecting establishment-focused awards rather than individual chef awards.

  3. Through a relatively simple cleaning process, we were able to substantially improve the completeness of geographic information in the dataset.

  4. Further cleaning could be beneficial, particularly for:

    • Standardizing city names
    • Handling complex location strings
    • Investigating remaining NA values to determine if location information exists in other fields or formats