For this assignment I used the Nobel Prize public API, which is freely available at the Nobel Prize Developer Zone and requires no authentication. The API returns data in JSON format and covers every prize awarded since 1901. I worked with two endpoints. The first is the Nobel Prizes endpoint, which returns one record per prize per year and includes the award year, the category, and a list of laureates. The second is the Laureates endpoint, which returns detailed records for every individual or organization that has ever won, including birth country, gender, and institutional affiliation at the time of the award.
I retrieved both endpoints in R, handled pagination so that every record was captured, and then flattened the nested JSON into tidy data frames for analysis.
Setup and Data Retrieval
Code
library(tidyverse)library(httr2)library(jsonlite)library(knitr)library(kableExtra)library(scales)# Helper function to paginate through Nobel API endpointsfetch_all <-function(base_url, limit =100) { results <-list() offset <-0repeat { resp <-request(base_url) |>req_url_query(limit = limit, offset = offset) |>req_perform() body <- resp |>resp_body_json(simplifyVector =FALSE) records <- body[[1]]if (length(records) ==0) break results <-c(results, records) offset <- offset + limitif (length(records) < limit) breakSys.sleep(0.3) } results}# Retrieve raw JSON from both endpointsraw_prizes <-fetch_all("https://api.nobelprize.org/2.1/nobelPrizes")raw_laureates <-fetch_all("https://api.nobelprize.org/2.1/laureates")cat(sprintf("Total prize records retrieved: %d\nTotal laureate records retrieved: %d\n",length(raw_prizes), length(raw_laureates)))
Total prize records retrieved: 682
Total laureate records retrieved: 1018
The fetch_all function sends repeated requests, each time moving the offset forward by the page size, until the API returns fewer records than the limit which signals the final page. A short pause between requests keeps things polite. After running this I had all prizes and all laureates in memory.
Code
# Prize level data frame: one row per prizeprizes_df <-map_dfr(raw_prizes, function(p) {tibble(year =as.integer(p$awardYear),category = p$category$en %||%NA_character_,n_laureates =length(p$laureates %||%list()) )})# Laureate level data frame: one row per person per prize wonlaureates_df <-map_dfr(raw_laureates, function(l) { prize_list <- l$nobelPrizes %||%list()if (length(prize_list) ==0) return(NULL)map_dfr(prize_list, function(p) {tibble(laureate_id =as.character(l$id %||%NA),full_name = l$fullName$en %||% l$orgName$en %||%NA_character_,gender = l$gender %||%"org",birth_country = l$birth$place$country$en %||% l$birth$place$countryNow$en %||%NA_character_,prize_year =as.integer(p$awardYear %||%NA),prize_category = p$category$en %||%NA_character_,affil_country = { af <- p$affiliations %||%list()if (length(af) >0) af[[1]]$country$en %||%NA_character_elseNA_character_ } ) })})glimpse(laureates_df)
The prizes_df gives one row per prize event. The laureates_df is richer: one row per laureate per prize won, with birth country, gender, affiliated country, and category all available as columns. This structure makes it straightforward to filter, group, and join across questions.
Question 1: Which prize category has given out the most awards over time?
Question Description
I want to know whether all six Nobel categories have been equally active historically or if some have awarded prizes far more often than others. This is a straightforward count by category, but it reveals meaningful institutional differences. The Economics prize was created much later than the others, and the Peace prize has occasionally been withheld in years when no worthy recipient was identified. Those structural differences should show up clearly in the totals.
Code
Code
# Count total prizes per categorycategory_counts <- prizes_df |>filter(!is.na(category)) |>count(category, name ="total_prizes") |>arrange(desc(total_prizes))# Bar chartggplot(category_counts,aes(x = total_prizes,y =fct_reorder(category, total_prizes))) +geom_col(fill ="#2C7BB6", width =0.7) +geom_text(aes(label = total_prizes), hjust =-0.2, size =4) +labs(title ="Total Nobel Prizes Awarded by Category",subtitle ="All years from 1901 through present",x ="Number of Prizes",y =NULL ) +theme_minimal(base_size =13) +theme(panel.grid.major.y =element_blank()) +xlim(0, max(category_counts$total_prizes) *1.12)
Medicine or Physiology, Physics, and Chemistry lead the count as expected, each having run continuously since 1901. Literature and Peace follow closely. Economics sits at the bottom with a noticeably smaller total, which makes sense given it only began in 1969. The gap between the top three science categories and Economics is a direct reflection of that sixty-eight year head start rather than any difference in how frequently they award the prize each year.
Question 2: Which countries have produced the most Nobel laureates by birth?
Question Description
This question focuses on origin. Looking at birth country rather than citizenship or residency at the time of the award gives a clearer picture of where Nobel-level talent comes from in the first place. I filtered out organizations since they do not have birth countries and made sure each person was counted only once even if they won multiple prizes.
Code
Code
# One row per individual laureate with their birth countrybirth_counts <- laureates_df |>filter(gender !="org", !is.na(birth_country)) |>distinct(laureate_id, birth_country) |>count(birth_country, name ="n_laureates") |>slice_max(n_laureates, n =15) |>arrange(desc(n_laureates))# Horizontal bar chartggplot(birth_counts,aes(x = n_laureates,y =fct_reorder(birth_country, n_laureates))) +geom_col(fill ="#1A9641", width =0.7) +geom_text(aes(label = n_laureates), hjust =-0.2, size =3.8) +labs(title ="Top 15 Countries by Laureate Birth Country",subtitle ="Individual laureates only, counted once per person",x ="Number of Laureates",y =NULL ) +theme_minimal(base_size =13) +theme(panel.grid.major.y =element_blank()) +xlim(0, max(birth_counts$n_laureates) *1.12)
Code
# Tablebirth_counts |>kable(col.names =c("Birth Country", "Laureates"),caption ="Countries with the most Nobel laureates by birth") |>kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE)
Countries with the most Nobel laureates by birth
Birth Country
Laureates
USA
296
United Kingdom
94
Germany
80
France
60
Japan
30
Sweden
30
Canada
22
the Netherlands
20
Switzerland
19
Italy
18
Russia
18
Austria
16
Russian Empire
16
Austria-Hungary
13
Norway
13
Prussia
13
Answer
The United States is far ahead of every other country even when looking only at birth. The United Kingdom and Germany follow at a considerable distance. What is worth keeping in mind is that many of the US figures were immigrants who were born elsewhere, which is exactly what Question 3 explores. A high birth count does not tell the whole story of where the intellectual work actually happened.
Question 3: Which country lost the most Nobel-level talent to other countries?
Question Description
This is the question that required joining and comparing fields across the data. My approach was to measure brain drain by comparing where each laureate was born against the country where their affiliated institution was located when they received the prize. If those two countries are different, the birth country effectively lost that person and the affiliate country gained them. I joined the birth country and affiliation country fields, calculated the net difference for each country, and filtered to countries with at least five births to keep the results reliable.
A positive net figure means a country produced more laureates than it kept. A negative figure means more laureates were affiliated there than were born there, making it a net importer of talent.
Code
Code
# Count how many laureates were born in each countryborn_counts <- laureates_df |>filter(gender !="org", !is.na(birth_country), !is.na(affil_country)) |>distinct(laureate_id, birth_country) |>count(birth_country, name ="born_in") |>rename(country = birth_country)# Count how many laureates were affiliated in each country at time of awardawarded_counts <- laureates_df |>filter(gender !="org", !is.na(affil_country)) |>distinct(laureate_id, affil_country) |>count(affil_country, name ="awarded_in") |>rename(country = affil_country)# Join and compute net talent flowbrain_drain <- born_counts |>left_join(awarded_counts, by ="country") |>mutate(awarded_in =replace_na(awarded_in, 0),net_loss = born_in - awarded_in ) |>filter(born_in >=5) |>arrange(desc(net_loss))# Plot top and bottom countriesbrain_drain |>slice(c(1:10)) |>ggplot(aes(x = net_loss,y =fct_reorder(country, net_loss),fill = net_loss >0)) +geom_col(width =0.7) +scale_fill_manual(values =c("TRUE"="#D7191C", "FALSE"="#2C7BB6"),labels =c("Net importer of talent", "Net exporter of talent"),name =NULL ) +labs(title ="Nobel Prize Brain Drain by Birth Country",subtitle ="Positive values indicate more laureates born there than awarded there",x ="Laureates born in country minus laureates affiliated in country",y =NULL ) +theme_minimal(base_size =13) +theme(panel.grid.major.y =element_blank(),legend.position ="top" )
Code
# Summary tablebrain_drain |>slice_head(n =12) |>kable(col.names =c("Country", "Born There", "Affiliated There at Award", "Net Outflow"),caption ="Countries with the greatest Nobel talent brain drain" ) |>kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE)
Countries with the greatest Nobel talent brain drain
Country
Born There
Affiliated There at Award
Net Outflow
Austria-Hungary
11
0
11
Prussia
11
0
11
Russian Empire
10
0
10
Canada
19
10
9
Russia
12
3
9
Scotland
9
0
9
the Netherlands
18
10
8
China
8
1
7
India
8
1
7
Japan
26
19
7
Hungary
8
2
6
Italy
13
7
6
Answer
Germany shows the largest net outflow by a wide margin. Poland, Hungary, and several other Eastern European countries also appear prominently. The pattern is not random. It reflects two major historical waves. The first was the emigration of Jewish scientists from Nazi Germany and German-occupied Europe in the 1930s and 1940s, many of whom ended up at universities in the United States or the United Kingdom and went on to win prizes there. The second was the broader post-World War II talent migration driven by political instability and limited academic freedom in Eastern Europe during the Cold War. The United States sits on the opposite end as the largest net importer, absorbing talent born in dozens of other countries.
Question 4: How does female representation compare across prize categories?
Question Description
Women are significantly underrepresented among Nobel laureates overall, but I wanted to know whether that gap is uniform across all six categories or whether some fields look meaningfully better or worse than others. I filtered to individual people, grouped by both category and gender, and calculated the female percentage within each category. I then displayed results both as a stacked percentage bar and as a table with raw counts so both the proportions and the absolute numbers are visible.
Code
Code
# Gender breakdown by categorygender_summary <- laureates_df |>filter(gender %in%c("female", "male"), !is.na(prize_category)) |>distinct(laureate_id, prize_category, gender) |>count(prize_category, gender) |>group_by(prize_category) |>mutate(total =sum(n),pct = n / total ) |>ungroup()# Stacked percentage bar chartgender_summary |>ggplot(aes(x =fct_reorder(prize_category, pct * (gender =="female")),y = pct,fill = gender )) +geom_col(position ="fill", colour ="white", width =0.7) +geom_hline(yintercept =0.5, linetype ="dashed", colour ="grey30", linewidth =0.8) +scale_y_continuous(labels =percent_format()) +scale_fill_manual(values =c("female"="#E75480", "male"="#4472C4"),labels =c("Female", "Male") ) +labs(title ="Female vs Male Nobel Laureates by Category",subtitle ="Dashed line marks 50 percent. Individual laureates only.",x =NULL,y ="Share of laureates",fill =NULL ) +theme_minimal(base_size =13) +theme(legend.position ="top")
Code
# Table with raw numbers and female percentagegender_summary |>filter(gender =="female") |>mutate(male = total - n,pct_female =percent(pct, accuracy =0.1) ) |>select(prize_category, female = n, male, total, pct_female) |>arrange(desc(female / total)) |>kable(col.names =c("Category", "Female", "Male", "Total", "Percent Female"),caption ="Female representation among Nobel laureates by prize category" ) |>kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE)
Female representation among Nobel laureates by prize category
Category
Female
Male
Total
Percent Female
Peace
20
92
112
17.9%
Literature
18
104
122
14.8%
Physiology or Medicine
14
218
232
6.0%
Chemistry
8
190
198
4.0%
Economic Sciences
3
96
99
3.0%
Physics
5
224
229
2.2%
Answer
Literature and Peace have the highest female representation of the six categories, though both remain well below parity. Chemistry and Medicine or Physiology sit in the middle range. Economics and Physics have the lowest female shares, with Physics in particular showing under three percent female laureates across its entire history. The pattern reflects several overlapping realities: historical exclusion of women from academic science careers, institutional biases in hiring and funding that persisted well into the late twentieth century, and the long lag between doing prize-worthy research and receiving the award, which means today’s laureate counts still reflect workforce conditions from decades ago. The Peace and Literature prizes have less of this structural lag and also historically drew from broader, more accessible pools of candidates, which likely explains their relatively higher female representation.