10BData607

Author

XiaoFei

Introduction

This analysis use the Nobel Prize organization’s public API to explore structured data on laureates and prizes. The goal is to answer the following four data-driven questions:

How has the share of female laureates changed across decades and categories?
Which countries have “lost” the most laureates to other nations (i.e., born in one country but affiliated elsewhere at award time)?
Which countries have produced the most Nobel laureates by birth?
How has the number of prizes per category evolved over time?

We will first establish a connection to the Nobel Prize API endpoints (`/laureates` and `/nobelPrizes`). The raw JSON responses will be saved into tibbles, then cleaned and joined to create two primary data frames: - `laureate_prizes` and `prizes_df for analysis.

Setup API Connection

library(httr2)
library(jsonlite)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(stringr)
library(forcats)
library(scales)
library(knitr)
library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

theme_set(theme_minimal(base_size = 12))

fetch_nobel <- function(endpoint, limit = 1000) {
  url <- paste0("https://api.nobelprize.org/2.1/", endpoint,
                "?limit=", limit, "&offset=0&format=json")
  resp <- request(url) |>
    req_headers(Accept = "application/json") |>
    req_perform()
  resp |> resp_body_string() |> fromJSON(flatten = TRUE)
}

# ── Laureates endpoint 
raw_laureates <- fetch_nobel("laureates", limit = 1000)

# ── Nobel Prizes endpoint 
raw_prizes <- fetch_nobel("nobelPrizes", limit = 1000)

ldf_raw <- raw_laureates$laureates


laureates <- tibble(
  id            = ldf_raw$id,
  full_name     = coalesce(ldf_raw$fullName.en, ldf_raw$orgName.en),
  gender        = ldf_raw$gender,
  birth_year    = as.integer(str_sub(ldf_raw$birth.date, 1, 4)),
  death_year    = as.integer(str_sub(ldf_raw$death.date, 1, 4)),
  birth_country = ldf_raw$birth.place.country.en,
  birth_countryNow = ldf_raw$birth.place.countryNow.en
)

prizes_nested <- ldf_raw$nobelPrizes

library(purrr)


Attaching package: 'purrr'

The following object is masked from 'package:scales':

    discard

The following object is masked from 'package:jsonlite':

    flatten

# Handles: NULL, empty list(), zero-row data frame, missing column.
safe_affil <- function(affil_list, field) {
  map_chr(affil_list, function(.x) {

    if (is.null(.x)) return(NA_character_)
    if (!is.data.frame(.x) && length(.x) == 0) return(NA_character_)
    if (is.data.frame(.x) && nrow(.x) == 0) return(NA_character_)
    if (is.data.frame(.x) && !field %in% names(.x)) return(NA_character_)
    val <- .x[[field]][[1]]
    if (is.null(val) || length(val) == 0) NA_character_ else as.character(val)
  })
}

extract_prizes <- function(i) {
  p <- prizes_nested[[i]]
  if (is.null(p) || !is.data.frame(p) || nrow(p) == 0) return(NULL)

  n_prizes <- nrow(p)

  affil_col <- if ("affiliations" %in% names(p)) p$affiliations
               else vector("list", n_prizes)
  tibble(
    id            = ldf_raw$id[[i]],
    prize_year    = as.integer(p$awardYear),
    category      = p$category.en,
    affiliation   = safe_affil(affil_col, "nameNow.en"),
    award_country = safe_affil(affil_col, "country.en")
  )
}

prizes_long <- map(seq_along(prizes_nested), extract_prizes) |>
  bind_rows()

# Join laureate info onto prizes
laureate_prizes <- laureates |>
  inner_join(prizes_long, by = "id")

glimpse(laureate_prizes)

Rows: 1,008
Columns: 11
$ id               <chr> "745", "102", "779", "259", "1004", "114", "982", "98…
$ full_name        <chr> "A. Michael Spence", "Aage Niels Bohr", "Aaron Ciecha…
$ gender           <chr> "male", "male", "male", "male", "male", "male", "male…
$ birth_year       <int> 1943, 1922, 1947, 1926, 1948, 1926, 1961, 1976, 1939,…
$ death_year       <int> NA, 2009, NA, 2018, NA, 1996, NA, NA, NA, NA, 1995, 1…
$ birth_country    <chr> "USA", "Denmark", "British Protectorate of Palestine"…
$ birth_countryNow <chr> "USA", "Denmark", "Israel", "Lithuania", NA, "Pakista…
$ prize_year       <int> 2001, 1975, 2004, 1982, 2021, 1979, 2019, 2019, 2009,…
$ category         <chr> "Economic Sciences", "Physics", "Chemistry", "Chemist…
$ affiliation      <chr> "Stanford University", "Niels Bohr Institute", "Techn…
$ award_country    <chr> "USA", "Denmark", "Israel", "United Kingdom", NA, "It…

Get prize-level data frame:

pdf_raw <- raw_prizes$nobelPrizes

prizes_df <- tibble(
  year        = as.integer(pdf_raw$awardYear),
  category    = pdf_raw$category.en,
  n_laureates = pdf_raw$numberLaureates,
  motivation  = coalesce(pdf_raw$topMotivation.en, NA_character_)
)

glimpse(prizes_df)

Rows: 682
Columns: 3
$ year       <int> 1901, 1901, 1901, 1901, 1901, 1902, 1902, 1902, 1902, 1902,…
$ category   <chr> "Chemistry", "Literature", "Peace", "Physics", "Physiology …
$ motivation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Explore questions1 : How has the share of female laureates changed across decades and categories?

To answer this, we group laureates by decade (based on prize_year) and category, then compute the percentage of female recipients. We also calculate an overall trend line across all categories to see if progress is universal or field-dependent.

gender_decade <- laureate_prizes |>
  filter(!is.na(gender)) |>
  mutate(decade = (prize_year %/% 10) * 10) |>
  group_by(decade, category) |>
  summarise(
    total   = n(),
    n_women = sum(gender == "female", na.rm = TRUE),
    pct_women = n_women / total * 100,
    .groups = "drop"
  )

# Overall by decade 
gender_overall <- laureate_prizes |>
  filter(!is.na(gender)) |>
  mutate(decade = (prize_year %/% 10) * 10) |>
  group_by(decade) |>
  summarise(
    total     = n(),
    n_women   = sum(gender == "female", na.rm = TRUE),
    pct_women = n_women / total * 100,
    .groups   = "drop"
  )

gender_overall |>
  kable(digits = 1,
        col.names = c("Decade", "Total Laureates",
                      "Women Laureates", "% Women"))

Decade	Total Laureates	Women Laureates	% Women
1900	56	3	5.4
1910	37	1	2.7
1920	53	2	3.8
1930	55	3	5.5
1940	39	3	7.7
1950	70	0	0.0
1960	74	3	4.1
1970	102	4	3.9
1980	91	4	4.4
1990	98	6	6.1
2000	115	11	9.6
2010	116	13	11.2
2020	72	14	19.4

ggplot(gender_decade |> filter(total >= 3),
       aes(decade, pct_women, colour = category, group = category)) +
  geom_line(linewidth = 0.9) +
  geom_point(aes(size = n_women), alpha = 0.8) +
  geom_smooth(data = gender_overall,
              aes(decade, pct_women, group = 1),
              colour = "black", linewidth = 1.2,
              method = "loess", se = TRUE, inherit.aes = FALSE) +
  scale_size_continuous(range = c(1, 6), guide = "none") +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    title = "Share of Nobel Prizes Awarded to Women, by Decade",
    x = "Decade",
    y = "% of laureates who are women",
    colour = "Category",
    caption = "Source: Nobel Prize API"
  ) +
  theme(legend.position = "bottom")

`geom_smooth()` using formula = 'y ~ x'

Questions 2 Which country has “lost” the most Nobel laureates — born there but awarded while a citizen of another country?

We define “lost” laureates as individuals whose birth country differs from their award_country (country of affiliation at prize time). Organizations are excluded to focus on individual mobility. The table ranks countries by number of emigrated laureates and identifies their most common destination.

brain_drain <- laureate_prizes |>
  filter(
    !is.na(birth_country),
    !is.na(award_country),
    gender != "org",   # exclude organisations
    birth_country != award_country          
  ) |>
  mutate(
    birth_country = str_trim(birth_country),
    award_country = str_trim(award_country)
  ) |>
  group_by(birth_country) |>
  summarise(
    laureates_lost = n(),
    top_destination = names(sort(table(award_country), decreasing = TRUE))[1],
    .groups = "drop"
  ) |>
  arrange(desc(laureates_lost)) |>
  slice_head(n = 15)


brain_drain |>
  kable(col.names = c("Birth Country", "Laureates Won Elsewhere","Most Common Destination"))

Birth Country	Laureates Won Elsewhere	Most Common Destination
Germany	26	USA
United Kingdom	24	USA
Canada	15	USA
France	12	USA
Austria-Hungary	11	USA
Prussia	11	Germany
Russia	10	USSR
Russian Empire	10	USSR
Scotland	9	United Kingdom
the Netherlands	9	USA
Italy	8	USA
China	7	USA
Hungary	7	USA
India	7	United Kingdom
Australia	6	United Kingdom

brain_drain |>
  mutate(birth_country = fct_reorder(birth_country, laureates_lost)) |>
  ggplot(aes(laureates_lost, birth_country, fill = top_destination)) +
  geom_col() +
  geom_text(aes(label = laureates_lost), hjust = -0.2, size = 3.5) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Countries That 'Lost' the Most Nobel Laureates",
    x = "Number of laureates won elsewhere",
    y = "Birth country",
    fill = "Most common\ndestination",
    caption = "Source: Nobel Prize API"
  ) +
  theme(legend.position = "right")

Q3. Which questions have produced the most Nobel laureates?

This question ignores later migration and simply counts laureates by their birth country. Organisations are excluded. The top 15 countries are visualized to show which nations have historically generated the most Nobel‑winning talent.

top_countries <- laureate_prizes |>
  filter(!is.na(birth_country), gender != "org") |>
  count(birth_country, name = "n_laureates") |>
  arrange(desc(n_laureates)) |>
  slice_head(n = 15)

top_countries |>
  kable(col.names = c("Birth Country", "Number of Laureates"))

Birth Country	Number of Laureates
USA	298
United Kingdom	94
Germany	78
France	60
Sweden	30
Japan	27
Canada	22
the Netherlands	20
Switzerland	19
Italy	18
Russia	18
Russian Empire	16
Austria	15
Austria-Hungary	13
Norway	13

#| fig-height: 5

top_countries |>
  mutate(birth_country = fct_reorder(birth_country, n_laureates)) |>
  ggplot(aes(n_laureates, birth_country, fill = n_laureates)) +
  geom_col() +
  geom_text(aes(label = n_laureates), hjust = -0.2, size = 3.5) +
  scale_fill_gradient(low = "#c6dbef", high = "#084594", guide = "none") +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title   = "Top 15 Countries by Nobel Laureate Birth Country",
    x       = "Number of laureates",
    y       = NULL,
    caption = "Source: Nobel Prize API"
  )

Q4. How many prizes have been awarded per category per decades?

We count prizes (not laureates) by decade and category. A stacked bar chart makes category composition easy to compare across decades.

prizes_by_decade <- prizes_df |>
  mutate(decade = (year %/% 10) * 10) |>
  group_by(decade, category) |>
  summarise(n_prizes = n(), .groups = "drop")

prizes_by_decade |>
  pivot_wider(names_from = category,
              values_from = n_prizes,
              values_fill = 0) |>
  arrange(decade) |>
  kable(caption = "Number of prizes awarded per decade by category")

Number of prizes awarded per decade by category
decade	Chemistry	Literature	Peace	Physics	Physiology or Medicine	Economic Sciences
1900	9	9	9	9	9	0
1910	10	10	10	10	10	0
1920	10	10	10	10	10	0
1930	10	10	10	10	10	0
1940	10	10	10	10	10	0
1950	10	10	10	10	10	0
1960	10	10	10	10	10	1
1970	10	10	10	10	10	10
1980	10	10	10	10	10	10
1990	10	10	10	10	10	10
2000	10	10	10	10	10	10
2010	10	10	10	10	10	10
2020	6	6	6	6	6	6

prizes_by_decade |>
  ggplot(aes(decade, n_prizes, fill = category)) +
  geom_col(position = "stack") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Nobel Prizes Awarded per Decade by Category",
    x        = "Decade",
    y        = "Number of prizes",
    fill     = "Category",
    caption  = "Source: Nobel Prize API"
  )

Conclusion

This analysis uncovered several trend about Nobel Prize data:

Gender gap: The share of female laureates has increased slowly but unevenly across disciplines. The overall upward trend (black line) is driven largely by the Peace and Literature categories, while Physics and Chemistry remain male‑dominated.
Germany has “lost” the most laureates (i.e., born there but awarded elsewhere), with the United States being the most common destination.
The United States leads by far in laureates, followed by the United Kingdom and Germany.
Prize frequency: The number of prizes per decade has remained stable at about 50 per decade across most categories, except for Economic Sciences, which was introduced after 1960.

Together, these findings demonstrate how simple API data can be transformed into meaningful sociological and historical insights and showing through visualizations.