Task: Use the Nobel Prize API and demonstrate rectangling. This version adds robust coercion for
birthDate,birthCtry, andaffilCtryto prevent list/NULL type issues during filtering and joins.
limit <- 1500
laur_raw <- jsonlite::read_json(
paste0("https://api.nobelprize.org/2.1/laureates?limit=", limit),
simplifyVector = FALSE
)
laureates <- tibble(raw = laur_raw$laureates)
# helper to coerce possible list/NULL/scalar into character
as_chr_scalar <- function(x) {
if (is.null(x) || length(x) == 0) return(NA_character_)
if (is.list(x)) return(as.character(x[[1]]))
as.character(x)
}
people <- laureates |>
hoist(raw,
id = "id",
known_en = list("knownName","en"),
gender = "gender",
birthDate = "birth","date",
birthCtry = "birth","place","country","en",
prizes = "nobelPrizes"
) |>
filter(!is.na(gender)) |>
mutate(
# coerce birthDate to Date safely
birthDate = purrr::map_chr(birthDate, as_chr_scalar),
birthDate = lubridate::ymd(birthDate, quiet = TRUE),
# coerce birthCtry to character safely
birthCtry = purrr::map_chr(birthCtry, as_chr_scalar)
)
Expand prizes and affiliations, coercing affilCtry to
character:
pr_aff <- people |>
unnest_longer(prizes, keep_empty = TRUE) |>
hoist(prizes,
awardYear = "awardYear",
category = list("category","en"),
affils = "affiliations"
) |>
mutate(
awardYear = suppressWarnings(as.integer(awardYear))
) |>
unnest_longer(affils, keep_empty = TRUE) |>
hoist(affils,
orgName = list("name","en"),
affilCtry = "location","country","en"
) |>
mutate(
affilCtry = purrr::map_chr(affilCtry, as_chr_scalar)
) |>
select(id, known_en, gender, birthDate, birthCtry, awardYear, category, orgName, affilCtry)
lost_counts <- pr_aff |>
filter(!is.na(birthCtry), !is.na(affilCtry), birthCtry != affilCtry) |>
count(birthCtry, sort = TRUE)
knitr::kable(head(lost_counts, 12),
caption = "Top birth countries where laureates were affiliated elsewhere when awarded (sample)")
| birthCtry | n |
|---|
Answer (Q1): The table lists the birth countries with the highest counts of “lost” laureates in this sample (`limit = 1500). Interpretation (Q1). A small number of birth countries account for most “lost” laureates—born in one country but affiliated elsewhere at award time—likely reflecting historic migration and research mobility. Results may shift slightly with the full API beyond the sampled limit.
age_df <- people |>
unnest_longer(prizes, keep_empty = TRUE) |>
hoist(prizes, awardYear = "awardYear", category = list("category","en")) |>
mutate(
awardYear = suppressWarnings(as.integer(awardYear)),
age_at_award = if_else(!is.na(birthDate) & !is.na(awardYear),
awardYear - lubridate::year(birthDate), NA_integer_)
) |>
filter(!is.na(age_at_award), age_at_award > 0)
age_trend <- age_df |>
mutate(decade = (awardYear %/% 10) * 10) |>
group_by(category, decade) |>
summarise(median_age = median(age_at_award), .groups = "drop")
knitr::kable(head(age_trend |> arrange(category, decade), 20),
caption = "Median age at award by decade and category (sample)")
| category | decade | median_age |
|---|---|---|
| Chemistry | 1900 | 50.0 |
| Chemistry | 1910 | 46.5 |
| Chemistry | 1920 | 53.0 |
| Chemistry | 1930 | 48.0 |
| Chemistry | 1940 | 55.0 |
| Chemistry | 1950 | 51.5 |
| Chemistry | 1960 | 52.0 |
| Chemistry | 1970 | 60.0 |
| Chemistry | 1980 | 54.0 |
| Chemistry | 1990 | 62.0 |
| Chemistry | 2000 | 64.0 |
| Chemistry | 2010 | 72.0 |
| Chemistry | 2020 | 56.0 |
| Economic Sciences | 1960 | 70.0 |
| Economic Sciences | 1970 | 67.0 |
| Economic Sciences | 1980 | 67.0 |
| Economic Sciences | 1990 | 65.0 |
| Economic Sciences | 2000 | 64.0 |
| Economic Sciences | 2010 | 68.0 |
| Economic Sciences | 2020 | 69.0 |
ggplot(age_trend, aes(decade, median_age, group = category)) +
geom_line() + geom_point() +
facet_wrap(~ category, scales = "free_y") +
labs(title = "Median age at award by decade",
x = "Decade", y = "Median age") +
theme_minimal(base_size = 11)
Answer (Q2): Median age generally trends upward over time for several categories; see the per-category facets above. Interpretation (Q2). Median age at award trends upward across the century in multiple fields, suggesting longer training/career arcs before prize-winning contributions are recognized. The slope varies by category, with some areas leveling more recently.
inst_counts <- pr_aff |>
filter(!is.na(orgName)) |>
count(orgName, affilCtry, sort = TRUE)
knitr::kable(head(inst_counts, 15),
caption = "Most frequent affiliated institutions (with country) in the sample")
| orgName | affilCtry | n |
|---|---|---|
| University of California | NA | 43 |
| Harvard University | NA | 29 |
| Massachusetts Institute of Technology (MIT) | NA | 25 |
| Stanford University | NA | 22 |
| California Institute of Technology (Caltech) | NA | 20 |
| University of Chicago | NA | 20 |
| Columbia University | NA | 18 |
| Princeton University | NA | 18 |
| University of Cambridge | NA | 18 |
| Howard Hughes Medical Institute | NA | 16 |
| Rockefeller University | NA | 13 |
| MRC Laboratory of Molecular Biology | NA | 10 |
| University of Oxford | NA | 10 |
| Yale University | NA | 9 |
| Cornell University | NA | 8 |
Answer (Q3): These institutions recur most often in the affiliation data associated with awards (sample-limited). Interpretation (Q3). A handful of institutions appear repeatedly across awards, indicating concentration of research capacity. Country labels show geographic clusters (e.g., U.S./U.K./Europe) that align with known funding and collaboration hubs.