10BData607

Author

XiaoFei

Introduction

This analysis use the Nobel Prize organization’s public API to explore structured data on laureates and prizes. The goal is to answer the following four data-driven questions:

  1. How has the share of female laureates changed across decades and categories?

  2. Which countries have “lost” the most laureates to other nations (i.e., born in one country but affiliated elsewhere at award time)?

  3. Which countries have produced the most Nobel laureates by birth?

  4. How has the number of prizes per category evolved over time?

We will first establish a connection to the Nobel Prize API endpoints (`/laureates` and `/nobelPrizes`). The raw JSON responses will be saved into tibbles, then cleaned and joined to create two primary data frames: - `laureate_prizes` and `prizes_df for analysis.

Setup API Connection

library(httr2)
library(jsonlite)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(stringr)
library(forcats)
library(scales)
library(knitr)
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
theme_set(theme_minimal(base_size = 12))

fetch_nobel <- function(endpoint, limit = 1000) {
  url <- paste0("https://api.nobelprize.org/2.1/", endpoint,
                "?limit=", limit, "&offset=0&format=json")
  resp <- request(url) |>
    req_headers(Accept = "application/json") |>
    req_perform()
  resp |> resp_body_string() |> fromJSON(flatten = TRUE)
}

# ── Laureates endpoint 
raw_laureates <- fetch_nobel("laureates", limit = 1000)

# ── Nobel Prizes endpoint 
raw_prizes <- fetch_nobel("nobelPrizes", limit = 1000)

ldf_raw <- raw_laureates$laureates


laureates <- tibble(
  id            = ldf_raw$id,
  full_name     = coalesce(ldf_raw$fullName.en, ldf_raw$orgName.en),
  gender        = ldf_raw$gender,
  birth_year    = as.integer(str_sub(ldf_raw$birth.date, 1, 4)),
  death_year    = as.integer(str_sub(ldf_raw$death.date, 1, 4)),
  birth_country = ldf_raw$birth.place.country.en,
  birth_countryNow = ldf_raw$birth.place.countryNow.en
)

prizes_nested <- ldf_raw$nobelPrizes

library(purrr)

Attaching package: 'purrr'
The following object is masked from 'package:scales':

    discard
The following object is masked from 'package:jsonlite':

    flatten
# Handles: NULL, empty list(), zero-row data frame, missing column.
safe_affil <- function(affil_list, field) {
  map_chr(affil_list, function(.x) {

    if (is.null(.x)) return(NA_character_)
    if (!is.data.frame(.x) && length(.x) == 0) return(NA_character_)
    if (is.data.frame(.x) && nrow(.x) == 0) return(NA_character_)
    if (is.data.frame(.x) && !field %in% names(.x)) return(NA_character_)
    val <- .x[[field]][[1]]
    if (is.null(val) || length(val) == 0) NA_character_ else as.character(val)
  })
}

extract_prizes <- function(i) {
  p <- prizes_nested[[i]]
  if (is.null(p) || !is.data.frame(p) || nrow(p) == 0) return(NULL)

  n_prizes <- nrow(p)

  affil_col <- if ("affiliations" %in% names(p)) p$affiliations
               else vector("list", n_prizes)
  tibble(
    id            = ldf_raw$id[[i]],
    prize_year    = as.integer(p$awardYear),
    category      = p$category.en,
    affiliation   = safe_affil(affil_col, "nameNow.en"),
    award_country = safe_affil(affil_col, "country.en")
  )
}

prizes_long <- map(seq_along(prizes_nested), extract_prizes) |>
  bind_rows()

# Join laureate info onto prizes
laureate_prizes <- laureates |>
  inner_join(prizes_long, by = "id")

glimpse(laureate_prizes)
Rows: 1,008
Columns: 11
$ id               <chr> "745", "102", "779", "259", "1004", "114", "982", "98…
$ full_name        <chr> "A. Michael Spence", "Aage Niels Bohr", "Aaron Ciecha…
$ gender           <chr> "male", "male", "male", "male", "male", "male", "male…
$ birth_year       <int> 1943, 1922, 1947, 1926, 1948, 1926, 1961, 1976, 1939,…
$ death_year       <int> NA, 2009, NA, 2018, NA, 1996, NA, NA, NA, NA, 1995, 1…
$ birth_country    <chr> "USA", "Denmark", "British Protectorate of Palestine"…
$ birth_countryNow <chr> "USA", "Denmark", "Israel", "Lithuania", NA, "Pakista…
$ prize_year       <int> 2001, 1975, 2004, 1982, 2021, 1979, 2019, 2019, 2009,…
$ category         <chr> "Economic Sciences", "Physics", "Chemistry", "Chemist…
$ affiliation      <chr> "Stanford University", "Niels Bohr Institute", "Techn…
$ award_country    <chr> "USA", "Denmark", "Israel", "United Kingdom", NA, "It…

Get prize-level data frame:

pdf_raw <- raw_prizes$nobelPrizes

prizes_df <- tibble(
  year        = as.integer(pdf_raw$awardYear),
  category    = pdf_raw$category.en,
  n_laureates = pdf_raw$numberLaureates,
  motivation  = coalesce(pdf_raw$topMotivation.en, NA_character_)
)

glimpse(prizes_df)
Rows: 682
Columns: 3
$ year       <int> 1901, 1901, 1901, 1901, 1901, 1902, 1902, 1902, 1902, 1902,…
$ category   <chr> "Chemistry", "Literature", "Peace", "Physics", "Physiology …
$ motivation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Explore questions1 : How has the share of female laureates changed across decades and categories?

To answer this, we group laureates by decade (based on prize_year) and category, then compute the percentage of female recipients. We also calculate an overall trend line across all categories to see if progress is universal or field-dependent.

gender_decade <- laureate_prizes |>
  filter(!is.na(gender)) |>
  mutate(decade = (prize_year %/% 10) * 10) |>
  group_by(decade, category) |>
  summarise(
    total   = n(),
    n_women = sum(gender == "female", na.rm = TRUE),
    pct_women = n_women / total * 100,
    .groups = "drop"
  )

# Overall by decade 
gender_overall <- laureate_prizes |>
  filter(!is.na(gender)) |>
  mutate(decade = (prize_year %/% 10) * 10) |>
  group_by(decade) |>
  summarise(
    total     = n(),
    n_women   = sum(gender == "female", na.rm = TRUE),
    pct_women = n_women / total * 100,
    .groups   = "drop"
  )

gender_overall |>
  kable(digits = 1,
        col.names = c("Decade", "Total Laureates",
                      "Women Laureates", "% Women"))
Decade Total Laureates Women Laureates % Women
1900 56 3 5.4
1910 37 1 2.7
1920 53 2 3.8
1930 55 3 5.5
1940 39 3 7.7
1950 70 0 0.0
1960 74 3 4.1
1970 102 4 3.9
1980 91 4 4.4
1990 98 6 6.1
2000 115 11 9.6
2010 116 13 11.2
2020 72 14 19.4
ggplot(gender_decade |> filter(total >= 3),
       aes(decade, pct_women, colour = category, group = category)) +
  geom_line(linewidth = 0.9) +
  geom_point(aes(size = n_women), alpha = 0.8) +
  geom_smooth(data = gender_overall,
              aes(decade, pct_women, group = 1),
              colour = "black", linewidth = 1.2,
              method = "loess", se = TRUE, inherit.aes = FALSE) +
  scale_size_continuous(range = c(1, 6), guide = "none") +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    title = "Share of Nobel Prizes Awarded to Women, by Decade",
    x = "Decade",
    y = "% of laureates who are women",
    colour = "Category",
    caption = "Source: Nobel Prize API"
  ) +
  theme(legend.position = "bottom")
`geom_smooth()` using formula = 'y ~ x'

Questions 2 Which country has “lost” the most Nobel laureates — born there but awarded while a citizen of another country?

We define “lost” laureates as individuals whose birth country differs from their award_country (country of affiliation at prize time). Organizations are excluded to focus on individual mobility. The table ranks countries by number of emigrated laureates and identifies their most common destination.

brain_drain <- laureate_prizes |>
  filter(
    !is.na(birth_country),
    !is.na(award_country),
    gender != "org",   # exclude organisations
    birth_country != award_country          
  ) |>
  mutate(
    birth_country = str_trim(birth_country),
    award_country = str_trim(award_country)
  ) |>
  group_by(birth_country) |>
  summarise(
    laureates_lost = n(),
    top_destination = names(sort(table(award_country), decreasing = TRUE))[1],
    .groups = "drop"
  ) |>
  arrange(desc(laureates_lost)) |>
  slice_head(n = 15)


brain_drain |>
  kable(col.names = c("Birth Country", "Laureates Won Elsewhere","Most Common Destination"))
Birth Country Laureates Won Elsewhere Most Common Destination
Germany 26 USA
United Kingdom 24 USA
Canada 15 USA
France 12 USA
Austria-Hungary 11 USA
Prussia 11 Germany
Russia 10 USSR
Russian Empire 10 USSR
Scotland 9 United Kingdom
the Netherlands 9 USA
Italy 8 USA
China 7 USA
Hungary 7 USA
India 7 United Kingdom
Australia 6 United Kingdom
brain_drain |>
  mutate(birth_country = fct_reorder(birth_country, laureates_lost)) |>
  ggplot(aes(laureates_lost, birth_country, fill = top_destination)) +
  geom_col() +
  geom_text(aes(label = laureates_lost), hjust = -0.2, size = 3.5) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Countries That 'Lost' the Most Nobel Laureates",
    x = "Number of laureates won elsewhere",
    y = "Birth country",
    fill = "Most common\ndestination",
    caption = "Source: Nobel Prize API"
  ) +
  theme(legend.position = "right")

Q3. Which questions have produced the most Nobel laureates?

This question ignores later migration and simply counts laureates by their birth country. Organisations are excluded. The top 15 countries are visualized to show which nations have historically generated the most Nobel‑winning talent.

top_countries <- laureate_prizes |>
  filter(!is.na(birth_country), gender != "org") |>
  count(birth_country, name = "n_laureates") |>
  arrange(desc(n_laureates)) |>
  slice_head(n = 15)

top_countries |>
  kable(col.names = c("Birth Country", "Number of Laureates"))
Birth Country Number of Laureates
USA 298
United Kingdom 94
Germany 78
France 60
Sweden 30
Japan 27
Canada 22
the Netherlands 20
Switzerland 19
Italy 18
Russia 18
Russian Empire 16
Austria 15
Austria-Hungary 13
Norway 13
#| fig-height: 5

top_countries |>
  mutate(birth_country = fct_reorder(birth_country, n_laureates)) |>
  ggplot(aes(n_laureates, birth_country, fill = n_laureates)) +
  geom_col() +
  geom_text(aes(label = n_laureates), hjust = -0.2, size = 3.5) +
  scale_fill_gradient(low = "#c6dbef", high = "#084594", guide = "none") +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title   = "Top 15 Countries by Nobel Laureate Birth Country",
    x       = "Number of laureates",
    y       = NULL,
    caption = "Source: Nobel Prize API"
  )

Q4. How many prizes have been awarded per category per decades?

We count prizes (not laureates) by decade and category. A stacked bar chart makes category composition easy to compare across decades.

prizes_by_decade <- prizes_df |>
  mutate(decade = (year %/% 10) * 10) |>
  group_by(decade, category) |>
  summarise(n_prizes = n(), .groups = "drop")

prizes_by_decade |>
  pivot_wider(names_from = category,
              values_from = n_prizes,
              values_fill = 0) |>
  arrange(decade) |>
  kable(caption = "Number of prizes awarded per decade by category")
Number of prizes awarded per decade by category
decade Chemistry Literature Peace Physics Physiology or Medicine Economic Sciences
1900 9 9 9 9 9 0
1910 10 10 10 10 10 0
1920 10 10 10 10 10 0
1930 10 10 10 10 10 0
1940 10 10 10 10 10 0
1950 10 10 10 10 10 0
1960 10 10 10 10 10 1
1970 10 10 10 10 10 10
1980 10 10 10 10 10 10
1990 10 10 10 10 10 10
2000 10 10 10 10 10 10
2010 10 10 10 10 10 10
2020 6 6 6 6 6 6
prizes_by_decade |>
  ggplot(aes(decade, n_prizes, fill = category)) +
  geom_col(position = "stack") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Nobel Prizes Awarded per Decade by Category",
    x        = "Decade",
    y        = "Number of prizes",
    fill     = "Category",
    caption  = "Source: Nobel Prize API"
  )

Conclusion

This analysis uncovered several trend about Nobel Prize data:

  • Gender gap: The share of female laureates has increased slowly but unevenly across disciplines. The overall upward trend (black line) is driven largely by the Peace and Literature categories, while Physics and Chemistry remain male‑dominated.

  • Germany has “lost” the most laureates (i.e., born there but awarded elsewhere), with the United States being the most common destination.

  • The United States leads by far in laureates, followed by the United Kingdom and Germany.

  • Prize frequency: The number of prizes per decade has remained stable at about 50 per decade across most categories, except for Economic Sciences, which was introduced after 1960.

Together, these findings demonstrate how simple API data can be transformed into meaningful sociological and historical insights and showing through visualizations.