Code
library(httr2)
library(jsonlite)
library(tidyverse)
library(lubridate)For this assignment, the Nobel Prize public API will be utilized to retrieve structured JSON data pertaining to Nobel laureates and prize awards. The API provides endpoints that return detailed information on laureates, including their names, gender, birth countries, affiliations, prize categories, and award years.
The first step in the process will involve making API calls directly within R using packages such as httr (or httr2) and jsonlite. The JSON responses will then be parsed and converted into R data frames using fromJSON().
Since the data is nested, additional steps will be required to properly unnest and tidy the data into a format suitable for analysis. This will involve the use of tidyverse tools such as dplyr and tidyr, including functions like unnest() and pivot_longer() where necessary.
Once the data is cleaned and structured, it will then be explored to identify meaningful patterns and relationships. Based on this exploration, the following four questions will be investigated:
How many individuals have received more than one Nobel Prize?
Which countries have produced the most Nobel laureates, based on country of birth?
Which Nobel Prize categories have shown the greatest net change in the number of laureates between their earliest and most recent recorded years?
How has the gender distribution of Nobel laureates changed over time?
These questions were selected to provide a mix of basic aggregation, categorical comparison, and time-based analysis. In particular, the third and fourth questions require examining trends across multiple variables (such as year, category, and gender), going beyond simple counts.
For each question, the objective will be clearly stated, the R code utilized to manipulate and analyze the data will be provided, and the results will be presented using appropriate outputs such as tables or visualizations created with ggplot2. The findings will then be interpreted in order to highlight any notable trends or insights.
This approach ensures that the workflow remains reproducible, transparent, and aligned with tidy data principles, while also demonstrating the ability to work with nested JSON data and perform meaningful data analysis.
As is the case with almost all of our analyses conducted within RStudio, the first step entails loading the relevant libraries. Specifically, the following libraries shall assist with the tasks of making API requests, parsing the retrieved JSON data structures, and performing visualization.
library(httr2)
library(jsonlite)
library(tidyverse)
library(lubridate)The next step involves retrieving data from the Nobel Prize API. In this case, the laureates endpoint will be queried, as it provides a comprehensive set of information pertaining to individuals and organizations awarded Nobel Prizes, including their demographic details and prize history.
laureates_url <- "https://api.nobelprize.org/2.1/laureates?limit=1100"
laureates_json <- request(laureates_url) %>%
req_perform() %>%
resp_body_json()At this stage, the API response has been successfully retrieved and stored in JSON format. The subsequent step will involve transforming this nested structure into a tidy format suitable for analysis.
Because the Nobel Prize API returns nested JSON data, the response must be rectangled before it may be analyzed effectively. In this case, hoist() will be used to pull key components out of the nested list-column, while unnest_longer() and unnest_wider() will be used to expand prize-level information into a tidy format.
# Convert JSON response into a tibble with one row per laureate
laureates_raw <- tibble(laureate = laureates_json$laureates)
# Pull key laureate-level fields from the nested list-column
laureates_tidy <- laureates_raw %>%
hoist(
laureate,
laureate_id = "id",
gender = "gender",
full_name = c("fullName", "en"),
org_name = c("orgName", "en"),
birth_date = c("birth", "date"),
birth_country = c("birth", "place", "country", "en"),
founded_date = c("founded", "date"),
founded_country = c("founded", "place", "country", "en"),
nobel_prizes = "nobelPrizes"
) %>%
mutate(
# Use individual name where available, otherwise organization name
name = coalesce(full_name, org_name),
# Use birth details for persons, or founding details for organizations
birth_date = coalesce(birth_date, founded_date),
birth_country = coalesce(birth_country, founded_country)
) %>%
select(
laureate_id,
name,
gender,
birth_date,
birth_country,
nobel_prizes
)
# Expand the Nobel Prize list so that each prize gets its own row
laureates_tidy <- laureates_tidy %>%
unnest_longer(nobel_prizes, keep_empty = TRUE)
# Widen each prize entry into columns
laureates_tidy <- laureates_tidy %>%
unnest_wider(nobel_prizes)
# Pull nested prize-level fields into top-level columns
laureates_df <- laureates_tidy %>%
hoist(
category,
category_en = "en"
) %>%
mutate(
award_year = as.integer(awardYear),
prize_amount = as.numeric(prizeAmount),
prize_amount_adjusted = as.numeric(prizeAmountAdjusted),
birth_year = suppressWarnings(lubridate::year(lubridate::ymd(birth_date)))
) %>%
select(
laureate_id,
name,
gender,
birth_date,
birth_year,
birth_country,
award_year,
category = category_en,
prize_amount,
prize_amount_adjusted
)
glimpse(laureates_df)Rows: 1,026
Columns: 10
$ laureate_id <chr> "745", "102", "779", "259", "1004", "114", "982"…
$ name <chr> "A. Michael Spence", "Aage Niels Bohr", "Aaron C…
$ gender <chr> "male", "male", "male", "male", "male", "male", …
$ birth_date <chr> "1943-00-00", "1922-06-19", "1947-10-01", "1926-…
$ birth_year <dbl> NA, 1922, 1947, 1926, NA, 1926, 1961, 1976, 1939…
$ birth_country <chr> "USA", "Denmark", "British Protectorate of Pales…
$ award_year <int> 2001, 1975, 2004, 1982, 2021, 1979, 2019, 2019, …
$ category <chr> "Economic Sciences", "Physics", "Chemistry", "Ch…
$ prize_amount <dbl> 10000000, 630000, 10000000, 1150000, 10000000, 8…
$ prize_amount_adjusted <dbl> 15547541, 4304697, 14874529, 3923237, 12096939, …
head(laureates_df)# A tibble: 6 × 10
laureate_id name gender birth_date birth_year birth_country award_year
<chr> <chr> <chr> <chr> <dbl> <chr> <int>
1 745 A. Michael … male 1943-00-00 NA USA 2001
2 102 Aage Niels … male 1922-06-19 1922 Denmark 1975
3 779 Aaron Ciech… male 1947-10-01 1947 British Prot… 2004
4 259 Aaron Klug male 1926-08-11 1926 Lithuania 1982
5 1004 Abdulrazak … male 1948-00-00 NA <NA> 2021
6 114 Abdus Salam male 1926-01-29 1926 India 1979
# ℹ 3 more variables: category <chr>, prize_amount <dbl>,
# prize_amount_adjusted <dbl>
The resulting laureates_df now contains one row per laureate-prize combination, which is appropriate for answering questions pertaining to repeat winners, country representation, category growth, and gender trends.
The first question seeks to examine how many laureates have been awarded the Nobel Prize on more than one occasion. Since the current data frame contains one row per laureate-prize combination, this may be determined by counting the number of prize records associated with each laureate and then filtering for those with more than one recorded award.
# Count the number of Nobel Prizes received by each laureate
multiple_winners <- laureates_df %>%
distinct(laureate_id, name, award_year, category) %>%
count(laureate_id, name, sort = TRUE, name = "num_prizes") %>%
filter(num_prizes > 1)
multiple_winners# A tibble: 7 × 3
laureate_id name num_prizes
<chr> <chr> <int>
1 482 International Committee of the Red Cross 3
2 217 Linus Carl Pauling 2
3 222 Frederick Sanger 2
4 515 Office of the United Nations High Commissioner for Ref… 2
5 6 Marie Curie, née Skłodowska 2
6 66 John Bardeen 2
7 743 K. Barry Sharpless 2
The output above identifies the specific laureates (both individuals and entities) who have received more than one Nobel Prize, along with the number of prizes associated with each.
Moreover, in addressing our actual proposed question, it was found that 5 individuals and 2 entities have been awarded the Nobel Prize more than once.
This second question shifts the focus from individuals to national representation. More specifically, it considers which birth countries appear most frequently in the dataset. Since each laureate should only be counted once for this question, distinct laureate-country combinations will first be retained before the totals are computed.
top_birth_countries <- laureates_df %>%
filter(!is.na(birth_country)) %>%
distinct(laureate_id, name, birth_country) %>%
count(birth_country, sort = TRUE)
head(top_birth_countries, 10)# A tibble: 10 × 2
birth_country n
<chr> <int>
1 USA 303
2 United Kingdom 96
3 Germany 80
4 France 62
5 Japan 31
6 Sweden 30
7 Switzerland 24
8 Canada 23
9 the Netherlands 20
10 Russia 19
The table above represents the top 10 countries ranked in accordance with the number of Nobel laureates produced. In interpreting this, we can see that the U.S.A. produced the greatest number of laureates (with a figure of 303), followed by the United Kingdom (with 96 laureates), and Germany (which produced 80 laureates).
A visualization may also be constructed to assist in comparing the highest-laureate-producing countries.
top_birth_countries %>%
slice_head(n = 10) %>%
ggplot(aes(x = reorder(birth_country, n), y = n, fill = birth_country)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Top 10 Birth Countries of Nobel Laureates",
x = "Birth Countries",
y = "Number of Laureates"
)As the visualization above illustrates, a small number of countries account for a staggering share of the laureate origins.
This third question introduces a more time-based analysis by examining which Nobel Prize categories displayed the greatest net change in the number of laureates between their earliest and most recent recorded years. In order to approach this, the data will first be grouped by both award year and category.
category_trends <- laureates_df %>%
filter(!is.na(award_year), !is.na(category)) %>%
distinct(laureate_id, name, award_year, category) %>%
count(award_year, category)
head(category_trends)# A tibble: 6 × 3
award_year category n
<int> <chr> <int>
1 1901 Chemistry 1
2 1901 Literature 1
3 1901 Peace 2
4 1901 Physics 1
5 1901 Physiology or Medicine 1
6 1902 Chemistry 1
A summarized measure of change across time, as follows, may provide a clearer basis for interpretation and comparison.
category_growth <- category_trends %>%
group_by(category) %>%
summarise(
first_year = min(award_year, na.rm = TRUE),
last_year = max(award_year, na.rm = TRUE),
first_count = n[award_year == first_year][1],
last_count = n[award_year == last_year][1],
growth = last_count - first_count,
.groups = "drop"
) %>%
arrange(desc(growth))
category_growth# A tibble: 6 × 6
category first_year last_year first_count last_count growth
<chr> <int> <int> <int> <int> <int>
1 Chemistry 1901 2025 1 3 2
2 Physics 1901 2025 1 3 2
3 Physiology or Medicine 1901 2025 1 3 2
4 Economic Sciences 1969 2025 2 3 1
5 Literature 1901 2025 1 1 0
6 Peace 1901 2025 2 1 -1
As the summary above indicates, Chemistry, Physics, and Physiology or Medicine exhibited the greatest overall growth, each increasing from 1 laureate in their earliest recorded year to 3 in 2025. In contrast, Literature remained unchanged between its earliest and most recent recorded years, while Peace displayed a slight decline.
The fourth, and final, question examines how gender representation among Nobel laureates has changed across the years. This requires comparison across both year and gender.
gender_trends <- laureates_df %>%
filter(!is.na(award_year), !is.na(gender)) %>%
distinct(laureate_id, name, award_year, gender) %>%
count(award_year, gender)
head(gender_trends)# A tibble: 6 × 3
award_year gender n
<int> <chr> <int>
1 1901 male 6
2 1902 male 7
3 1903 female 1
4 1903 male 6
5 1904 male 5
6 1905 female 1
The table above shows the yearly number of laureates by gender. A line graph may now be used to illustrate the changes in these counts across the variable of time (award_year).
ggplot(data = gender_trends, aes( x = award_year, y = n, color = gender)) +
geom_line() +
labs(
title = "Gender Distribution of Nobel Laureates Over Time",
x = "Award Year",
y = "Number of Laureates",
color = "Gender"
)The above visualization indicates that Nobel Prizes have historically been awarded predominantly to male recipients, with their numbers increasing steadily over time. However, female laureates begin to appear more frequently in later years, reflecting a gradual, albeit limited, improvement in gender representation.
Overall, this exercise demonstrated how nested JSON data structures from the Nobel Prize API may be transformed into a tidy data format in R and used to answer a range of data-driven questions. The analysis showed that only a small number of laureates have received more than one Nobel Prize, that a relatively small group of countries account for a substantial share of laureates by birth, and that male recipients have historically dominated the awards, even though female representation has increased somewhat in more recent times.