Introduction

This project imports data from the “laureates” endpoint of the Nobel Prize API (https://api.nobelprize.org/2.1/laureates) and converted JSON format into an R data frame for analysis. Various tidyverse functions are used to clean, unnest, summarize and test the data regarding the Nobel Prize history.

Four questions are asked in this project:

  1. What’s the gender distribution among Nobel Laureates?
  2. What are the top 10 institutions with the most Laureates?
  3. How has the Nobel prize amounts changed over time?
  4. Do male and female receive different average prize amounts?
library(httr)
library(jsonlite)
library(tidyverse)
# import data from laureates api, limit = 1000
url <- "https://api.nobelprize.org/2.1/laureates?limit=1000"
response <- GET(url)

# convert JSON to data frame
df <- fromJSON(content(response, "text"), flatten=TRUE)
# select cols
# nobelPrizes and affiliations are list-columns
laureates <- df$laureates %>%
  unnest(nobelPrizes, keep_empty = TRUE, names_sep = "_") %>%
  unnest(nobelPrizes_affiliations, keep_empty = TRUE, names_sep = "_") %>%
  select(
    fullName.en, 
    gender, 
    birth.date, 
    death.date, 
    birth.place.city.en, 
    birth.place.country.en, 
    nobelPrizes_awardYear, 
    nobelPrizes_prizeAmount, 
    nobelPrizes_affiliations_name.en,
    nobelPrizes_category.en
  )

Question 1: What is the gender distribution of laureates?

laureates %>%
  count(gender) %>%
  ggplot(aes(x = gender, y = n, fill = gender)) +
  geom_col(width = 0.5) +
  geom_text(
    aes(label = n,
        vjust = 0.1)
  ) +
  labs(
    title = "Gender Distribution of Nobel Laureates",
    x = "Gender",
    y = "Count"
  )

It’s quite stunning that between 1901 and 2025, the proportion between female and male is 70:996, among which there are duplicate laureates. The NAs are organizations. It raises another question that what causes this huge gap.

Question 2: Which institutions have the most laureates?

laureates %>%
  filter(!is.na(nobelPrizes_affiliations_name.en)) %>%
  count(nobelPrizes_affiliations_name.en, sort = TRUE) %>%
  slice_max(n, n=10) %>%
  ggplot(aes(reorder(nobelPrizes_affiliations_name.en,n),y=n)) +
  geom_col(width=0.5, fill = "lightblue") +
  coord_flip() +
  labs(
    title = "Top 10 Institutions",
    x = "Institutions",
    y = "Count"
  )

Among top 10 institutions, only the University of Cambridge is in UK. All the other institutions are in the United States.

Question 3: How has the prize amount changed over time?

laureates %>%
  group_by(nobelPrizes_awardYear) %>%
  summarise(avg_price = mean(nobelPrizes_prizeAmount, na.rm=TRUE)) %>%
  ggplot(aes(x = nobelPrizes_awardYear, y = avg_price/1e6, group = 1)) +
  geom_line() +
  geom_point(color = "red") +
  labs(
    title = "Average Nobel Price Amount Time Trend",
    x = "Year",
    y = "Prizes (Millions SEK)"
  ) +
  theme(axis.text.x = element_blank())

The line chart show the actual prizes amount instead the PrizesAmountAdjusted. Because the time frame is between 1901 and 2025, we can guess the prizes amount start bursting around 1970s and 1980s.

Question 4: Do male and female laureates receive different average prize amounts?

Hypothesis test:

Null hypothesis(H0): Male and female received the same average prize amount.

Alternative hypothesis(Ha): Means were different.

# select data and remove nulls
gender_df <- laureates %>%
  filter(!is.na(gender), !is.na(nobelPrizes_prizeAmount)) %>%
  select(gender, nobelPrizes_prizeAmount)

# statistics summary
gender_df %>%
  group_by(gender) %>%
  summarise(
    count = n(),
    avg_prize = mean(nobelPrizes_prizeAmount),
    sd_prize = sd(nobelPrizes_prizeAmount)
  )
## # A tibble: 2 × 4
##   gender count avg_prize sd_prize
##   <chr>  <int>     <dbl>    <dbl>
## 1 female    70  6506586. 4362382.
## 2 male     996  4038864. 4304569.
# t-test
t.test(nobelPrizes_prizeAmount ~ gender, data = gender_df)
## 
##  Welch Two Sample t-test
## 
## data:  nobelPrizes_prizeAmount by gender
## t = 4.5788, df = 78.741, p-value = 1.724e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  1394916 3540528
## sample estimates:
## mean in group female   mean in group male 
##              6506586              4038864

The p-value is 1.724e-05, which is much smaller than 0.05. The null hypothesis is rejected and it’s statistically significant that the mean prize amounts between male and female laureates are different.

Also, we are 95% confident that the true mean difference of the prize amounts of male and female laureates is between 1,394,916(SEK) and 3,540,528(SEK). 0 is not included so it supports rejecting the null hypothesis.

Conclusion

The project demonstrates how API data was imported, structured and tidied in R to answer real-world questions. The analysis found the huge gap between the amount of male and female laureates; 9 out of 10 among the top 10 institutions with the most laureates locate in the United States; the prize amounts increased in a faster pace after 1960s; the hypothesis test indicates the statistically significant difference between the average prize amounts received by the male and female laureates.

Overall, this project highlights how R is used to explore real-world questions through visualization and statistical tools.