week_10B_nobel_prize

Author

Brandon Chanderban

Published

April 15, 2026

Introduction/Approach

For this assignment, the Nobel Prize public API will be utilized to retrieve structured JSON data pertaining to Nobel laureates and prize awards. The API provides endpoints that return detailed information on laureates, including their names, gender, birth countries, affiliations, prize categories, and award years.

Data Retrieval and Preparation

The first step in the process will involve making API calls directly within R using packages such as httr (or httr2) and jsonlite. The JSON responses will then be parsed and converted into R data frames using fromJSON().

Since the data is nested, additional steps will be required to properly unnest and tidy the data into a format suitable for analysis. This will involve the use of tidyverse tools such as dplyr and tidyr, including functions like unnest() and pivot_longer() where necessary.

Research Questions

Once the data is cleaned and structured, it will then be explored to identify meaningful patterns and relationships. Based on this exploration, the following four questions will be investigated:

  1. How many individuals have received more than one Nobel Prize?

  2. Which countries have produced the most Nobel laureates, based on country of birth?

  3. Which Nobel Prize categories have shown the greatest net change in the number of laureates between their earliest and most recent recorded years?

  4. How has the gender distribution of Nobel laureates changed over time?

These questions were selected to provide a mix of basic aggregation, categorical comparison, and time-based analysis. In particular, the third and fourth questions require examining trends across multiple variables (such as year, category, and gender), going beyond simple counts.

Analysis and Presentation

For each question, the objective will be clearly stated, the R code utilized to manipulate and analyze the data will be provided, and the results will be presented using appropriate outputs such as tables or visualizations created with ggplot2. The findings will then be interpreted in order to highlight any notable trends or insights.

This approach ensures that the workflow remains reproducible, transparent, and aligned with tidy data principles, while also demonstrating the ability to work with nested JSON data and perform meaningful data analysis.

Code Base/Body

As is the case with almost all of our analyses conducted within RStudio, the first step entails loading the relevant libraries. Specifically, the following libraries shall assist with the tasks of making API requests, parsing the retrieved JSON data structures, and performing visualization.

Code
library(httr2)
library(jsonlite)
library(tidyverse)
library(lubridate)

Retrieving the Nobel Prize Data

The next step involves retrieving data from the Nobel Prize API. In this case, the laureates endpoint will be queried, as it provides a comprehensive set of information pertaining to individuals and organizations awarded Nobel Prizes, including their demographic details and prize history.

Code
laureates_url <- "https://api.nobelprize.org/2.1/laureates?limit=1100"

laureates_json <- request(laureates_url) %>%
  req_perform() %>%
  resp_body_json()

At this stage, the API response has been successfully retrieved and stored in JSON format. The subsequent step will involve transforming this nested structure into a tidy format suitable for analysis.

Transforming and Tidying the Data

Because the Nobel Prize API returns nested JSON data, the response must be rectangled before it may be analyzed effectively. In this case, hoist() will be used to pull key components out of the nested list-column, while unnest_longer() and unnest_wider() will be used to expand prize-level information into a tidy format.

Code
# Convert JSON response into a tibble with one row per laureate
laureates_raw <- tibble(laureate = laureates_json$laureates)

# Pull key laureate-level fields from the nested list-column
laureates_tidy <- laureates_raw %>%
  hoist(
    laureate,
    laureate_id = "id",
    gender = "gender",
    full_name = c("fullName", "en"),
    org_name = c("orgName", "en"),
    birth_date = c("birth", "date"),
    birth_country = c("birth", "place", "country", "en"),
    founded_date = c("founded", "date"),
    founded_country = c("founded", "place", "country", "en"),
    nobel_prizes = "nobelPrizes"
  ) %>%
  mutate(
    # Use individual name where available, otherwise organization name
    name = coalesce(full_name, org_name),

    # Use birth details for persons, or founding details for organizations
    birth_date = coalesce(birth_date, founded_date),
    birth_country = coalesce(birth_country, founded_country)
  ) %>%
  select(
    laureate_id,
    name,
    gender,
    birth_date,
    birth_country,
    nobel_prizes
  )

# Expand the Nobel Prize list so that each prize gets its own row
laureates_tidy <- laureates_tidy %>%
  unnest_longer(nobel_prizes, keep_empty = TRUE)

# Widen each prize entry into columns
laureates_tidy <- laureates_tidy %>%
  unnest_wider(nobel_prizes)

# Pull nested prize-level fields into top-level columns
laureates_df <- laureates_tidy %>%
  hoist(
    category,
    category_en = "en"
  ) %>%
  mutate(
    award_year = as.integer(awardYear),
    prize_amount = as.numeric(prizeAmount),
    prize_amount_adjusted = as.numeric(prizeAmountAdjusted),
    birth_year = suppressWarnings(lubridate::year(lubridate::ymd(birth_date)))
  ) %>%
  select(
    laureate_id,
    name,
    gender,
    birth_date,
    birth_year,
    birth_country,
    award_year,
    category = category_en,
    prize_amount,
    prize_amount_adjusted
  )

glimpse(laureates_df)
Rows: 1,026
Columns: 10
$ laureate_id           <chr> "745", "102", "779", "259", "1004", "114", "982"…
$ name                  <chr> "A. Michael Spence", "Aage Niels Bohr", "Aaron C…
$ gender                <chr> "male", "male", "male", "male", "male", "male", …
$ birth_date            <chr> "1943-00-00", "1922-06-19", "1947-10-01", "1926-…
$ birth_year            <dbl> NA, 1922, 1947, 1926, NA, 1926, 1961, 1976, 1939…
$ birth_country         <chr> "USA", "Denmark", "British Protectorate of Pales…
$ award_year            <int> 2001, 1975, 2004, 1982, 2021, 1979, 2019, 2019, …
$ category              <chr> "Economic Sciences", "Physics", "Chemistry", "Ch…
$ prize_amount          <dbl> 10000000, 630000, 10000000, 1150000, 10000000, 8…
$ prize_amount_adjusted <dbl> 15547541, 4304697, 14874529, 3923237, 12096939, …
Code
head(laureates_df)
# A tibble: 6 × 10
  laureate_id name         gender birth_date birth_year birth_country award_year
  <chr>       <chr>        <chr>  <chr>           <dbl> <chr>              <int>
1 745         A. Michael … male   1943-00-00         NA USA                 2001
2 102         Aage Niels … male   1922-06-19       1922 Denmark             1975
3 779         Aaron Ciech… male   1947-10-01       1947 British Prot…       2004
4 259         Aaron Klug   male   1926-08-11       1926 Lithuania           1982
5 1004        Abdulrazak … male   1948-00-00         NA <NA>                2021
6 114         Abdus Salam  male   1926-01-29       1926 India               1979
# ℹ 3 more variables: category <chr>, prize_amount <dbl>,
#   prize_amount_adjusted <dbl>

The resulting laureates_df now contains one row per laureate-prize combination, which is appropriate for answering questions pertaining to repeat winners, country representation, category growth, and gender trends.

Addressing the Four Proposed Questions

Question 1: How many individuals have received more than one Nobel Prize?

The first question seeks to examine how many laureates have been awarded the Nobel Prize on more than one occasion. Since the current data frame contains one row per laureate-prize combination, this may be determined by counting the number of prize records associated with each laureate and then filtering for those with more than one recorded award.

Code
# Count the number of Nobel Prizes received by each laureate

multiple_winners <- laureates_df %>%
  distinct(laureate_id, name, award_year, category) %>%
  count(laureate_id, name, sort = TRUE, name = "num_prizes") %>%
  filter(num_prizes > 1)

multiple_winners
# A tibble: 7 × 3
  laureate_id name                                                    num_prizes
  <chr>       <chr>                                                        <int>
1 482         International Committee of the Red Cross                         3
2 217         Linus Carl Pauling                                               2
3 222         Frederick Sanger                                                 2
4 515         Office of the United Nations High Commissioner for Ref…          2
5 6           Marie Curie, née Skłodowska                                      2
6 66          John Bardeen                                                     2
7 743         K. Barry Sharpless                                               2

The output above identifies the specific laureates (both individuals and entities) who have received more than one Nobel Prize, along with the number of prizes associated with each.

Moreover, in addressing our actual proposed question, it was found that 5 individuals and 2 entities have been awarded the Nobel Prize more than once.

Question 2: Which countries have produced the most Nobel laureates, based on country of birth?

This second question shifts the focus from individuals to national representation. More specifically, it considers which birth countries appear most frequently in the dataset. Since each laureate should only be counted once for this question, distinct laureate-country combinations will first be retained before the totals are computed.

Code
top_birth_countries <- laureates_df %>%
  filter(!is.na(birth_country)) %>%
  distinct(laureate_id, name, birth_country) %>%
  count(birth_country, sort = TRUE)

head(top_birth_countries, 10)
# A tibble: 10 × 2
   birth_country       n
   <chr>           <int>
 1 USA               303
 2 United Kingdom     96
 3 Germany            80
 4 France             62
 5 Japan              31
 6 Sweden             30
 7 Switzerland        24
 8 Canada             23
 9 the Netherlands    20
10 Russia             19

The table above represents the top 10 countries ranked in accordance with the number of Nobel laureates produced. In interpreting this, we can see that the U.S.A. produced the greatest number of laureates (with a figure of 303), followed by the United Kingdom (with 96 laureates), and Germany (which produced 80 laureates).

A visualization may also be constructed to assist in comparing the highest-laureate-producing countries.

Code
top_birth_countries %>%
  slice_head(n = 10) %>%
  ggplot(aes(x = reorder(birth_country, n), y = n, fill = birth_country)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Top 10 Birth Countries of Nobel Laureates",
    x = "Birth Countries",
    y = "Number of Laureates"
  )

As the visualization above illustrates, a small number of countries account for a staggering share of the laureate origins.

Question 3: Which Nobel Prize categories have shown the greatest net change in the number of laureates between their earliest and most recent recorded years?

This third question introduces a more time-based analysis by examining which Nobel Prize categories displayed the greatest net change in the number of laureates between their earliest and most recent recorded years. In order to approach this, the data will first be grouped by both award year and category.

Code
category_trends <- laureates_df %>%
  filter(!is.na(award_year), !is.na(category)) %>%
  distinct(laureate_id, name, award_year, category) %>%
  count(award_year, category)

head(category_trends)
# A tibble: 6 × 3
  award_year category                   n
       <int> <chr>                  <int>
1       1901 Chemistry                  1
2       1901 Literature                 1
3       1901 Peace                      2
4       1901 Physics                    1
5       1901 Physiology or Medicine     1
6       1902 Chemistry                  1

A summarized measure of change across time, as follows, may provide a clearer basis for interpretation and comparison.

Code
category_growth <- category_trends %>%
  group_by(category) %>%
  summarise(
    first_year = min(award_year, na.rm = TRUE),
    last_year = max(award_year, na.rm = TRUE),
    first_count = n[award_year == first_year][1],
    last_count = n[award_year == last_year][1],
    growth = last_count - first_count,
    .groups = "drop"
  ) %>%
  arrange(desc(growth))

category_growth
# A tibble: 6 × 6
  category               first_year last_year first_count last_count growth
  <chr>                       <int>     <int>       <int>      <int>  <int>
1 Chemistry                    1901      2025           1          3      2
2 Physics                      1901      2025           1          3      2
3 Physiology or Medicine       1901      2025           1          3      2
4 Economic Sciences            1969      2025           2          3      1
5 Literature                   1901      2025           1          1      0
6 Peace                        1901      2025           2          1     -1

As the summary above indicates, Chemistry, Physics, and Physiology or Medicine exhibited the greatest overall growth, each increasing from 1 laureate in their earliest recorded year to 3 in 2025. In contrast, Literature remained unchanged between its earliest and most recent recorded years, while Peace displayed a slight decline.

Question 4: How has the gender of Nobel laureates changed over time?

The fourth, and final, question examines how gender representation among Nobel laureates has changed across the years. This requires comparison across both year and gender.

Code
gender_trends <- laureates_df %>%
  filter(!is.na(award_year), !is.na(gender)) %>%
  distinct(laureate_id, name, award_year, gender) %>%
  count(award_year, gender)

head(gender_trends)
# A tibble: 6 × 3
  award_year gender     n
       <int> <chr>  <int>
1       1901 male       6
2       1902 male       7
3       1903 female     1
4       1903 male       6
5       1904 male       5
6       1905 female     1

The table above shows the yearly number of laureates by gender. A line graph may now be used to illustrate the changes in these counts across the variable of time (award_year).

Code
ggplot(data = gender_trends, aes( x = award_year, y = n, color = gender)) +
  geom_line() +
  labs(
    title = "Gender Distribution of Nobel Laureates Over Time",
    x = "Award Year",
    y = "Number of Laureates",
    color = "Gender"
  )

The above visualization indicates that Nobel Prizes have historically been awarded predominantly to male recipients, with their numbers increasing steadily over time. However, female laureates begin to appear more frequently in later years, reflecting a gradual, albeit limited, improvement in gender representation.

Conclusion

Overall, this exercise demonstrated how nested JSON data structures from the Nobel Prize API may be transformed into a tidy data format in R and used to answer a range of data-driven questions. The analysis showed that only a small number of laureates have received more than one Nobel Prize, that a relatively small group of countries account for a substantial share of laureates by birth, and that male recipients have historically dominated the awards, even though female representation has increased somewhat in more recent times.

LLM Used