Assignment 10B – Codebase

Author

Muhammad Suffyan Khan

Published

April 19, 2026

Objective

The objective of this assignment is to use the public Nobel Prize API data provided by NobelPrize.org in JSON format, retrieve the data in R, transform the nested JSON into tidy data frames, and answer four interesting data-driven questions based on the Nobel Prize dataset.

This assignment will demonstrate the ability to work with JSON data from an API, flatten and tidy nested structures, and perform exploratory analysis using tidyverse tools in R. In addition to answering straightforward summary questions, at least one of the questions will go beyond simple counts by requiring filtering, comparison, and field-level analysis across laureate and prize information.

The final result will be a single reproducible Quarto document that includes the four questions, all code used to retrieve and process the JSON data, and the resulting answers in the form of tables, summaries, and visualizations.

Selected Data Source

For this assignment, I will use the official Nobel Prize API made available through the Nobel Prize Developer Zone.

The two main API endpoints that will be used where appropriate are:

https://api.nobelprize.org/2.1/nobelPrizes
https://api.nobelprize.org/2.1/laureates

These endpoints return Nobel Prize data in JSON format, including information about prize categories, award years, laureates, birth information, affiliations, and award motivations.

API Documentation: https://www.nobelprize.org/about/developer-zone-2/

Since this assignment requires JSON processing in R, the analysis will directly retrieve the API responses in JSON format rather than relying on manually downloaded files. This supports reproducibility because the data retrieval and transformation steps will be included directly in the Quarto document.

Planned Questions

The following four questions will guide the analysis:

Question 1

Which Nobel Prize categories have been awarded most frequently?

This question will provide a category-level summary of Nobel Prize awards and serve as an introductory overview of the dataset.

Question 2

Which decades have had the highest number of Nobel Prize awards or laureates?

This question will examine how Nobel Prize activity has varied over time by grouping awards into decades.

Question 3

Which birth countries have produced the highest number of Nobel laureates?

This question will use laureate-level information to identify which countries are most frequently represented as places of birth among Nobel laureates.

Question 4

Which countries appear to lose the most Nobel laureates, meaning laureates born in one country but affiliated with or awarded through another country?

This question goes beyond simple counts and will require comparing multiple country-related fields from the laureate data. It is intended to satisfy the assignment requirement that at least one question involve more than a basic frequency count by using filtering and cross-field comparison.

Planned Workflow

The workflow for this assignment will be:

Load required libraries such as httr2 or jsonlite, along with tidyverse packages including dplyr, tidyr, purrr, stringr, and ggplot2
Retrieve JSON data from one or both Nobel Prize API endpoints
Parse the JSON responses into R lists
Inspect the JSON structure to identify the main nested fields relevant to prizes, laureates, years, categories, countries, and affiliations
Extract and flatten the nested components into tidy tibbles
Standardize column names and retain only the variables needed for analysis
Convert year fields to numeric values where needed and derive additional variables such as decade
Clean country-related fields so they can be grouped and compared consistently
Create separate tidy data frames if necessary for:
- prize-level data
- laureate-level data
- affiliation or award-related country data
Join or compare data frames where needed to answer the more analytical question(s)
Produce tables, summaries, and visualizations for each of the four questions
Include short interpretations of each result directly in the Quarto report

Planned Data Preparation

Because the Nobel Prize API returns nested JSON, one of the main preparation tasks will be converting hierarchical API output into tidy rectangular data frames.

Several data preparation steps are expected:

Nested laureate and prize information may need to be unnested into separate rows
Country information may appear in different fields and may require extraction from nested objects
Some laureates or prizes may have missing metadata, such as absent birth locations, affiliations, or organization-related fields
Category and year fields may need to be standardized for consistent grouping and plotting
In some cases, institutions and individuals may appear differently in the raw data, so only relevant fields will be retained depending on the question being answered

To keep the analysis tidy and interpretable, only variables directly relevant to the four questions will be preserved in the final analytical tables.

Validation and Quality Checks

To strengthen the reliability of the analysis, I will include basic validation checks during data preparation.

These checks may include:

verifying that the API request returns data successfully
checking the structure of the parsed JSON objects
confirming that key columns such as year, category, and laureate identifiers are present after transformation
inspecting missing values in important fields such as birth country or affiliation country
checking row counts before and after unnesting to ensure the transformation behaves as expected
reviewing distinct category names and year ranges for consistency

These checks will help ensure that the tidy data frames accurately reflect the original JSON data and that the final answers are based on valid transformations.

Anticipated Challenges

One expected challenge is that the Nobel Prize API data is nested and may represent people, organizations, prizes, and affiliations in slightly different ways. This means some fields may not be directly comparable without additional cleaning.

Another challenge is that country-related analysis can be more complex than simple counting because the place of birth and the affiliation or award-related country may not always be stored in exactly the same format or level of detail. Some records may also have missing or incomplete location information.

In addition, because one of the questions compares country-related fields, careful filtering and interpretation will be required to avoid overstating results when data is incomplete or ambiguous.

Expected Outcome

The expected outcome is a reproducible Quarto report that demonstrates the full workflow of retrieving Nobel Prize JSON data from an API, transforming that data into tidy data frames, and answering four clearly stated data-driven questions.

The report will show not only that JSON data can be successfully parsed and analyzed in R, but also that the Nobel Prize dataset can be used for more meaningful exploratory analysis beyond simple counts. Visualizations such as bar charts or time-based plots will be used where appropriate to support interpretation.

Overall, this assignment will demonstrate JSON handling, tidy data transformation, exploratory analysis, and clear presentation of results in a single self-contained document.

Codebase

Load Libraries

library(httr2)
library(tidyverse)
library(jsonlite)
library(knitr)

These libraries are used for API requests, JSON handling, data wrangling, tables, and visualizations.

Retrieve Complete Data from the Nobel Prize API

The Nobel Prize API is paginated, so all pages must be collected to analyze the full dataset rather than only the first 25 records.

`%||%` <- function(x, y) {
  if (is.null(x) || length(x) == 0) y else x
}

get_en_value <- function(x) {
  if (is.null(x) || length(x) == 0) return(NA_character_)
  
  if (is.character(x)) {
    return(x[[1]])
  }
  
  if (is.list(x) && !is.null(x$en)) {
    return(as.character(x$en[[1]]))
  }
  
  NA_character_
}

fetch_all_pages <- function(base_url, results_key) {
  all_results <- list()
  next_url <- base_url
  
  while (!is.null(next_url)) {
    page <- request(next_url) |>
      req_perform() |>
      resp_body_json(simplifyVector = FALSE)
    
    all_results <- c(all_results, page[[results_key]])
    next_url <- page$links[["next"]] %||% NULL
  }
  
  all_results
}

prizes_raw <- fetch_all_pages(
  "https://api.nobelprize.org/2.1/nobelPrizes",
  "nobelPrizes"
)

laureates_raw <- fetch_all_pages(
  "https://api.nobelprize.org/2.1/laureates",
  "laureates"
)

This step retrieves the full Nobel Prize and laureate data from the API.

Build Tidy Data Frames

Prize-Level Data

This table keeps one row per prize award and is used for category and decade summaries.

prizes_tbl <- map_dfr(prizes_raw, function(prize) {
  tibble(
    award_year = as.integer(prize$awardYear %||% NA_character_),
    category = get_en_value(prize$category),
    category_full = get_en_value(prize$categoryFullName)
  )
}) %>%
  distinct() %>%
  mutate(decade = floor(award_year / 10) * 10)

glimpse(prizes_tbl)

Rows: 682
Columns: 4
$ award_year    <int> 1901, 1901, 1901, 1901, 1901, 1902, 1902, 1902, 1902, 19…
$ category      <chr> "Chemistry", "Literature", "Peace", "Physics", "Physiolo…
$ category_full <chr> "The Nobel Prize in Chemistry", "The Nobel Prize in Lite…
$ decade        <dbl> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 19…

Laureate-Level Data

This table keeps one row per laureate and stores core birth-country information.

laureates_tbl <- map_dfr(laureates_raw, function(laureate) {
  tibble(
    laureate_id = as.character(laureate$id %||% NA_character_),
    laureate_name = coalesce(
      get_en_value(laureate$knownName),
      get_en_value(laureate$fullName),
      get_en_value(laureate$orgName)
    ),
    gender = laureate$gender %||% NA_character_,
    birth_country = get_en_value(laureate$birth$place$country),
    birth_country_now = get_en_value(laureate$birth$place$countryNow)
  )
})

glimpse(laureates_tbl)

Rows: 1,018
Columns: 5
$ laureate_id       <chr> "745", "102", "779", "259", "1004", "114", "982", "9…
$ laureate_name     <chr> "A. Michael Spence", "Aage N. Bohr", "Aaron Ciechano…
$ gender            <chr> "male", "male", "male", "male", "male", "male", "mal…
$ birth_country     <chr> "USA", "Denmark", "British Protectorate of Palestine…
$ birth_country_now <chr> "USA", "Denmark", "Israel", "Lithuania", NA, "Pakist…

Laureate–Prize Data

This table expands the nested nobelPrizes field so each laureate-prize combination becomes a row.

laureate_prizes_tbl <- map_dfr(laureates_raw, function(laureate) {
  laureate_id <- as.character(laureate$id %||% NA_character_)
  laureate_name <- coalesce(
    get_en_value(laureate$knownName),
    get_en_value(laureate$fullName),
    get_en_value(laureate$orgName)
  )
  birth_country <- get_en_value(laureate$birth$place$country)
  birth_country_now <- get_en_value(laureate$birth$place$countryNow)
  
  prizes <- laureate$nobelPrizes
  
  if (is.null(prizes) || length(prizes) == 0) {
    return(tibble())
  }
  
  map_dfr(prizes, function(prize) {
    tibble(
      laureate_id = laureate_id,
      laureate_name = laureate_name,
      birth_country = birth_country,
      birth_country_now = birth_country_now,
      award_year = as.integer(prize$awardYear %||% NA_character_),
      category = get_en_value(prize$category),
      motivation = get_en_value(prize$motivation)
    )
  })
}) %>%
  mutate(decade = floor(award_year / 10) * 10)

glimpse(laureate_prizes_tbl)

Rows: 1,026
Columns: 8
$ laureate_id       <chr> "745", "102", "779", "259", "1004", "114", "982", "9…
$ laureate_name     <chr> "A. Michael Spence", "Aage N. Bohr", "Aaron Ciechano…
$ birth_country     <chr> "USA", "Denmark", "British Protectorate of Palestine…
$ birth_country_now <chr> "USA", "Denmark", "Israel", "Lithuania", NA, "Pakist…
$ award_year        <int> 2001, 1975, 2004, 1982, 2021, 1979, 2019, 2019, 2009…
$ category          <chr> "Economic Sciences", "Physics", "Chemistry", "Chemis…
$ motivation        <chr> "for their analyses of markets with asymmetric infor…
$ decade            <dbl> 2000, 1970, 2000, 1980, 2020, 1970, 2010, 2010, 2000…

Laureate–Prize–Affiliation Data

This table extracts affiliation countries from each laureate’s prize record. It is used for the advanced comparison question.

affiliations_tbl <- map_dfr(laureates_raw, function(laureate) {
  laureate_id <- as.character(laureate$id %||% NA_character_)
  laureate_name <- coalesce(
    get_en_value(laureate$knownName),
    get_en_value(laureate$fullName),
    get_en_value(laureate$orgName)
  )
  birth_country <- get_en_value(laureate$birth$place$country)
  birth_country_now <- get_en_value(laureate$birth$place$countryNow)
  
  prizes <- laureate$nobelPrizes
  
  if (is.null(prizes) || length(prizes) == 0) {
    return(tibble())
  }
  
  map_dfr(prizes, function(prize) {
    affiliations <- prize$affiliations
    
    if (is.null(affiliations) || length(affiliations) == 0) {
      return(tibble(
        laureate_id = laureate_id,
        laureate_name = laureate_name,
        birth_country = birth_country,
        birth_country_now = birth_country_now,
        award_year = as.integer(prize$awardYear %||% NA_character_),
        category = get_en_value(prize$category),
        affiliation_country = NA_character_
      ))
    }
    
    map_dfr(affiliations, function(aff) {
      tibble(
        laureate_id = laureate_id,
        laureate_name = laureate_name,
        birth_country = birth_country,
        birth_country_now = birth_country_now,
        award_year = as.integer(prize$awardYear %||% NA_character_),
        category = get_en_value(prize$category),
        affiliation_country = coalesce(
          get_en_value(aff$countryNow),
          get_en_value(aff$country)
        )
      )
    })
  })
}) %>%
  mutate(decade = floor(award_year / 10) * 10)

glimpse(affiliations_tbl)

Rows: 1,115
Columns: 8
$ laureate_id         <chr> "745", "102", "779", "259", "1004", "114", "114", …
$ laureate_name       <chr> "A. Michael Spence", "Aage N. Bohr", "Aaron Ciecha…
$ birth_country       <chr> "USA", "Denmark", "British Protectorate of Palesti…
$ birth_country_now   <chr> "USA", "Denmark", "Israel", "Lithuania", NA, "Paki…
$ award_year          <int> 2001, 1975, 2004, 1982, 2021, 1979, 1979, 2019, 20…
$ category            <chr> "Economic Sciences", "Physics", "Chemistry", "Chem…
$ affiliation_country <chr> "USA", "Denmark", "Israel", "United Kingdom", NA, …
$ decade              <dbl> 2000, 1970, 2000, 1980, 2020, 1970, 1970, 2010, 20…

Validation Checks

These checks confirm that the full dataset was retrieved and that the most important fields were created correctly.

tibble(
  dataset = c("prizes_tbl", "laureates_tbl", "laureate_prizes_tbl", "affiliations_tbl"),
  rows = c(nrow(prizes_tbl), nrow(laureates_tbl), nrow(laureate_prizes_tbl), nrow(affiliations_tbl))
) %>%
  kable(caption = "Row counts for the main tidy tables")

Row counts for the main tidy tables
dataset	rows
prizes_tbl	682
laureates_tbl	1018
laureate_prizes_tbl	1026
affiliations_tbl	1115

tibble(
  min_award_year = min(prizes_tbl$award_year, na.rm = TRUE),
  max_award_year = max(prizes_tbl$award_year, na.rm = TRUE),
  distinct_categories = n_distinct(prizes_tbl$category),
  missing_birth_country = sum(is.na(laureates_tbl$birth_country_now)),
  missing_affiliation_country = sum(is.na(affiliations_tbl$affiliation_country))
) %>%
  kable(caption = "Basic validation checks")

Basic validation checks
min_award_year	max_award_year	distinct_categories	missing_birth_country	missing_affiliation_country
1901	2025	6	31	271

Question 1: Which Nobel Prize Categories Have Been Awarded Most Frequently?

This question provides a simple overview of how often each category has appeared in the Nobel Prize dataset.

q1_category_counts <- prizes_tbl %>%
  count(category, sort = TRUE)

q1_category_counts %>%
  kable(caption = "Nobel Prize categories ranked by number of awards")

Nobel Prize categories ranked by number of awards
category	n
Chemistry	125
Literature	125
Peace	125
Physics	125
Physiology or Medicine	125
Economic Sciences	57

Visualization

q1_category_counts %>%
  ggplot(aes(x = reorder(category, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Nobel Prize Categories by Number of Awards",
    x = "Category",
    y = "Number of Awards"
  )

Brief Interpretation

The results show that Chemistry, Literature, Peace, Physics, and Physiology or Medicine each have 125 awards, reflecting their long-standing presence since the early years of the Nobel Prize. Economic Sciences has a much lower count of 57 because it was introduced later than the original Nobel categories.

Question 2: Which Decades Have Had the Highest Number of Nobel Prize Awards?

This question examines how Nobel Prize activity changes over time by grouping awards into decades.

q2_decade_counts <- prizes_tbl %>%
  filter(!is.na(decade)) %>%
  count(decade, sort = TRUE)

q2_decade_counts %>%
  kable(caption = "Number of Nobel Prize awards by decade")

Number of Nobel Prize awards by decade
decade	n
1970	60
1980	60
1990	60
2000	60
2010	60
1960	51
1910	50
1920	50
1930	50
1940	50
1950	50
1900	45
2020	36

Visualization

q2_decade_counts %>%
  ggplot(aes(x = decade, y = n)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Nobel Prize Awards by Decade",
    x = "Decade",
    y = "Number of Awards"
  )

Brief Interpretation

The decade counts are highest in the 1970s, 1980s, 1990s, 2000s, and 2010s, each with 60 awards, showing a stable modern pattern of Nobel Prize activity across six categories. Earlier decades have slightly lower totals, and the 2020s are currently lower because the decade is still in progress.

Question 3: Which Birth Countries Have Produced the Highest Number of Nobel Laureates?

This question uses laureate-level birthplace information to identify which countries are most represented among Nobel laureates.

q3_birth_country_counts <- laureates_tbl %>%
  filter(!is.na(birth_country_now)) %>%
  count(birth_country_now, sort = TRUE)

q3_birth_country_counts %>%
  slice_head(n = 15) %>%
  kable(caption = "Top 15 present-day birth countries of Nobel laureates")

Top 15 present-day birth countries of Nobel laureates
birth_country_now	n
USA	296
United Kingdom	94
Germany	84
France	63
Japan	30
Sweden	30
Russia	29
Poland	28
Canada	22
Italy	20
the Netherlands	20
Austria	19
Switzerland	19
Norway	13
China	12

Visualization

q3_birth_country_counts %>%
  slice_head(n = 15) %>%
  ggplot(aes(x = reorder(birth_country_now, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 15 Present-Day Birth Countries of Nobel Laureates",
    x = "Birth Country (Present-Day)",
    y = "Number of Laureates"
  )

Brief Interpretation

The United States stands far above all other countries with 296 laureates, followed by the United Kingdom and Germany. This suggests that Nobel laureates are heavily concentrated in a small number of countries, especially the United States, which dominates the present-day birth-country distribution in this dataset.

Question 4: Which Countries Appear to Lose the Most Nobel Laureates?

To make this question analytical rather than a simple count, this section compares each laureate’s present-day birth country with the country of award-time affiliation recorded in the prize data. Cases where the two do not match are treated as cross-country movement.

q4_country_loss <- affiliations_tbl %>%
  filter(
    !is.na(birth_country_now),
    !is.na(affiliation_country),
    birth_country_now != affiliation_country
  ) %>%
  distinct(laureate_id, award_year, category, birth_country_now, affiliation_country) %>%
  count(birth_country_now, sort = TRUE)

q4_country_loss %>%
  slice_head(n = 15) %>%
  kable(caption = "Top 15 birth countries whose laureates were awarded with affiliations in another country")

Top 15 birth countries whose laureates were awarded with affiliations in another country
birth_country_now	n
Germany	27
United Kingdom	25
Poland	20
Canada	15
France	15
Austria	12
Russia	12
the Netherlands	10
Hungary	9
Scotland	9
Italy	8
Japan	8
China	7
Australia	6
India	5

Visualization

q4_country_loss %>%
  slice_head(n = 15) %>%
  ggplot(aes(x = reorder(birth_country_now, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Countries Losing Laureates to Another Affiliation Country",
    x = "Birth Country (Present-Day)",
    y = "Cross-Country Laureate Count"
  )

Additional Comparison Table

This table shows the most common country-to-country flows.

q4_flows <- affiliations_tbl %>%
  filter(
    !is.na(birth_country_now),
    !is.na(affiliation_country),
    birth_country_now != affiliation_country
  ) %>%
  distinct(laureate_id, award_year, category, birth_country_now, affiliation_country) %>%
  count(birth_country_now, affiliation_country, sort = TRUE)

q4_flows %>%
  slice_head(n = 15) %>%
  kable(caption = "Most common cross-country laureate flows")

Most common cross-country laureate flows
birth_country_now	affiliation_country	n
United Kingdom	USA	19
Canada	USA	15
Germany	USA	14
France	USA	9
Poland	Germany	9
Japan	USA	8
Scotland	United Kingdom	7
China	USA	6
Poland	USA	6
Russia	USA	6
Austria	Germany	5
Austria	USA	5
Hungary	USA	5
Italy	USA	5
Germany	Switzerland	4

Brief Interpretation

Germany, the United Kingdom, and Poland appear most often in the cross-country counts, meaning many laureates born there were affiliated with institutions in another country at the time of the award. The flow table shows that the United States is the most common destination, which suggests that it has been a major attractor of Nobel-level researchers and scholars.

Conclusion

The Nobel Prize API data was successfully retrieved in full, transformed into tidy data frames, and used to answer four data-driven questions. The analysis shows category-level patterns, long-term trends across decades, country-level birthplace representation, and cross-country affiliation patterns among Nobel laureates.

Overall, this codebase demonstrates complete JSON retrieval, nested data handling, tidy transformation, and comparative analysis in a reproducible Quarto workflow.