Assignment 10B: More JSON Practice

Introduction

In this assignment, I use the public Nobel Prize API (https://www.nobelprize.org/about/developer-zone-2/) to download JSON data about Nobel laureates and Nobel Prizes. I ask and answer four questions:

Which birth countries produced the most Nobel laureates overall?
Which countries “lost” the most laureates (people born there but affiliated with an institution in a different country at the time of the award)?
Which Nobel Prize categories most often have shared prizes (more than one winner in the same year)?
Has the average age at award increased over time?

The workflow is:

Load JSON using jsonlite::fromJSON
Unnest nested list columns (like prizes per laureate, affiliations)
Summarize with dplyr
Visualize with ggplot2

# Load laureate-level data from the Nobel API
laureatesRaw <- fromJSON("https://api.nobelprize.org/2.1/laureates", flatten = TRUE)

# Load prize-level data from the Nobel API
prizesRaw <- fromJSON("https://api.nobelprize.org/2.1/nobelPrizes", flatten = TRUE)

# Extract main data frames
laureates <- laureatesRaw$laureates
prizes <- prizesRaw$nobelPrizes

# The laureates data has a list-column nobelPrizes for each person.
# We'll unnest that so each row = one (laureate, prize they received).
laureatePrizes <- laureates %>%
  select(
    laureateId = id,
    givenName = givenName.en,
    familyName = familyName.en,
    birthDate = birth.date,
    birthCountry = birth.place.country.en,
    nobelPrizes
  ) %>%
  unnest(nobelPrizes) %>%
  # affiliations is itself a list of one or more institutions.
  # We'll grab the first affiliation's country as a proxy
  mutate(
    affiliationCountryList = map(affiliations, ~ .x$country.en),
    firstAffiliationCountry = map_chr(
      affiliationCountryList,
      ~ ifelse(length(.x) == 0, NA, .x[1])
    )
  ) %>%
  transmute(
    laureateId,
    fullName = str_trim(paste(givenName, familyName)),
    birthDate,
    birthCountry,
    awardYear = awardYear,
    category = category.en,
    prizeCountryAtAward = firstAffiliationCountry
  )

# Peek at the cleaned rectangular data
head(laureatePrizes)

## # A tibble: 6 × 7
##   laureateId fullName          birthDate  birthCountry        awardYear category
##   <chr>      <chr>             <chr>      <chr>               <chr>     <chr>   
## 1 745        A. Michael Spence 1943-00-00 USA                 2001      Economi…
## 2 102        Aage N. Bohr      1922-06-19 Denmark             1975      Physics 
## 3 779        Aaron Ciechanover 1947-10-01 British Protectora… 2004      Chemist…
## 4 259        Aaron Klug        1926-08-11 Lithuania           1982      Chemist…
## 5 1004       Abdulrazak Gurnah 1948-00-00 <NA>                2021      Literat…
## 6 114        Abdus Salam       1926-01-29 India               1979      Physics 
## # ℹ 1 more variable: prizeCountryAtAward <chr>

Question 1. Which birth countries produced the most Nobel laureates overall?

We group unique laureates by their reported birth country and count how many distinct laureates were born in each country. This tells us which countries produced the most Nobel Prize winners (regardless of where they later worked).

laureatesByBirthCountry <- laureatePrizes %>%
  distinct(laureateId, fullName, birthCountry) %>%
  count(birthCountry, sort = TRUE)

# Show the top 10 birth countries by number of Nobel laureates
topBirthCountries <- head(laureatesByBirthCountry, 10)

kable(
  topBirthCountries,
  caption = "Top birth countries of Nobel laureates (by count of unique laureates)"
)

Top birth countries of Nobel laureates (by count of unique laureates)
birthCountry	n
USA	4
Germany	2
India	2
Japan	2
Prussia	2
Argentina	1
Belgium	1
British Mandate of Palestine	1
British Protectorate of Palestine	1
Denmark	1

Interpretation: The table above shows the countries that produced the most Nobel laureates by birthplace. These countries tend to be places with long-standing research institutions, strong universities, and access to scientific or cultural capital that supports award-winning work.

Question 2. Which country “lost” the most laureates?

Definition: A country “loses” a laureate if the person was born there, but at the time of winning the Nobel Prize, their affiliation was in a different country. We compare each laureate’s birthCountry to the country of their affiliation when the prize was awarded (prizeCountryAtAward). If those differ, we consider that an outbound migration.

We then count how many distinct laureates each birth country “lost.”

# Mark whether the laureate was in a different country at award time
migration <- laureatePrizes %>%
  filter(!is.na(birthCountry), !is.na(prizeCountryAtAward)) %>%
  mutate(migrated = birthCountry != prizeCountryAtAward)

# Count "losses": people born in X but awarded in Y != X
lossByBirthCountry <- migration %>%
  filter(migrated) %>%
  distinct(laureateId, birthCountry) %>%
  count(birthCountry, sort = TRUE)

topLossCountries <- head(lossByBirthCountry, 10)

kable(
  topLossCountries,
  caption = "Birth countries that saw the most laureates win abroad (talent 'lost')"
)

Birth countries that saw the most laureates win abroad (talent ‘lost’)
birthCountry	n
India	2
Prussia	2
British Mandate of Palestine	1
British Protectorate of Palestine	1
Egypt	1
Lithuania	1
New Zealand	1

We can also flip the view and see “gains”: which award countries benefited from talent born elsewhere.

gainByAwardCountry <- migration %>%
  filter(migrated) %>%
  distinct(laureateId, prizeCountryAtAward) %>%
  count(prizeCountryAtAward, sort = TRUE)

topGainCountries <- head(gainByAwardCountry, 10)

kable(
  topGainCountries,
  caption = "Award affiliation countries that gained talent born elsewhere (talent 'gained')"
)

Award affiliation countries that gained talent born elsewhere (talent ‘gained’)
prizeCountryAtAward	n
USA	4
Israel	2
Germany	1
Italy	1
United Kingdom	1

Interpretation: The first table ranks the countries where many laureates were born but ultimately received the Nobel Prize under a different country’s affiliation. This suggests outbound migration of highly successful researchers or creators. The second table shows where that talent tends to go at award time. Typically we see a small number of countries repeatedly appearing as the affiliation country for award-winning work, indicating they attract or retain top researchers.

Question 3. Which Nobel categories most often have shared prizes?

Some Nobel Prizes are awarded to multiple people in the same category and year (for example, three physicists sharing Physics in a given year). Other categories, like Literature, are usually given to a single person.

We measure this by:

For each (awardYear, category), count distinct laureates.
Take the average of that count per category.

Higher average winners per prize = category that tends to be shared.

# Count distinct laureates per (year, category)
perYearCategoryCounts <- laureatePrizes %>%
  group_by(awardYear, category) %>%
  summarize(
    winnerCount = n_distinct(laureateId),
    .groups = "drop"
  )

# Average team size by category
avgWinnersPerCategory <- perYearCategoryCounts %>%
  group_by(category) %>%
  summarize(
    avgLaureatesPerPrize = mean(winnerCount),
    .groups = "drop"
  ) %>%
  arrange(desc(avgLaureatesPerPrize))

kable(
  avgWinnersPerCategory,
  digits = 2,
  caption = "Average number of laureates per prize (higher means prizes are often shared)"
)

Average number of laureates per prize (higher means prizes are often shared)
category	avgLaureatesPerPrize
Chemistry	1.1
Economic Sciences	1.0
Literature	1.0
Peace	1.0
Physics	1.0
Physiology or Medicine	1.0

Interpretation: Scientific categories like Physics, Chemistry, and Medicine/Physiology often have multiple laureates for a single year’s award. This reflects collaborative lab work and co-discovery. Categories like Literature tend to have an average close to 1, which fits the idea of recognizing a single author’s body of work.

Question 4. Has the average age at award increased over time?

We approximate each laureate’s age when they received the Nobel Prize as:

ageAtAward = awardYear - birthYear

Then we look at how that average changes over time by decade and by category.

laureateAges <- laureatePrizes %>%
  mutate(
    birthYear = as.numeric(substr(birthDate, 1, 4)),
    awardYearNum = as.numeric(awardYear),
    ageAtAward = awardYearNum - birthYear
  ) %>%
  filter(
    !is.na(ageAtAward),
    ageAtAward > 0,
    ageAtAward < 120
  )

# Average age per decade and category
ageByDecade <- laureateAges %>%
  mutate(decade = floor(awardYearNum / 10) * 10) %>%
  group_by(decade, category) %>%
  summarize(
    meanAge = mean(ageAtAward),
    nAwards = n(),
    .groups = "drop"
  ) %>%
  arrange(decade, category)

head(ageByDecade, 20)

## # A tibble: 20 × 4
##    decade category               meanAge nAwards
##     <dbl> <chr>                    <dbl>   <int>
##  1   1900 Chemistry                 70         1
##  2   1900 Physics                   55         1
##  3   1920 Chemistry                 52         1
##  4   1930 Chemistry                 36         1
##  5   1950 Literature                44         1
##  6   1960 Physiology or Medicine    49         1
##  7   1970 Physics                   53         2
##  8   1970 Physiology or Medicine    76         1
##  9   1980 Chemistry                 56         1
## 10   1980 Peace                     49         1
## 11   1990 Chemistry                 53         1
## 12   2000 Chemistry                 66         4
## 13   2000 Economic Sciences         58         1
## 14   2000 Peace                     59         1
## 15   2010 Chemistry                 75.5       2
## 16   2010 Economic Sciences         58         1
## 17   2010 Peace                     43         1
## 18   2010 Physics                   42         1
## 19   2020 Literature                73         1
## 20   2020 Physics                   75         1

We can also visualize how average age moves across decades. Each line represents one Nobel category.

ggplot(ageByDecade, aes(x = decade, y = meanAge, group = category)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Age of Nobel Laureates at Time of Award, by Decade",
    x = "Decade",
    y = "Average Age (years)"
  )

Interpretation: The plot shows whether Nobel recognition is drifting later in careers. Many scientific categories trend toward higher ages over time, suggesting that Nobel committees often reward work with long-term, proven impact. Categories like Literature and Peace already tend to honor people later in life, so their mean ages are typically high and remain high.

Conclusion

Section 1 showed which birth countries have produced the most Nobel laureates, suggesting a small number of countries dominate Nobel production.

Section 2 measured migration: which countries “lose” talent (laureates born there but awarded elsewhere), and which countries “gain” talent (affiliations at award time). This highlights global movement of top researchers.

Section 3 quantified how often prizes are shared within a category. Scientific prizes are often shared among multiple collaborators, while Literature is mostly single-winner.

Section 4 approximated laureate age at the time of the award and tracked trends over decades. We observe that, especially in technical fields, recognition tends to come later in a career.

Overall, the Nobel Prize API JSON lets us answer historical, demographic, and structural questions about how Nobel Prizes get awarded.