NobelPrize.org

Author

ZIHAO YU

1. How will I tackle the problem?

The four questions are about 1). gender in the proportion, 2). age(age gap, Average age, etc), 3). Which Nobel Prize category has the largest number of laureates, and 4). Which laureates were born in one country but awarded as citizens of another?

Retrieve Nobel Prize data via an API. After converting the raw data and completing data cleaning and organization, conduct exploratory data analysis to answer four specific questions.

2. What data challenges do I anticipate?

The data covers a long period, and the completeness of records may vary across different time periods; therefore, some variables may have missing values or incomplete information.

Countries may have undergone name changes, divisions, or mergers over time, which can affect further organization and comparative analysis.

3. Load the data to work with APIs

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(httr2)
library(jsonlite)


Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten

req1 <- request("https://api.nobelprize.org/2.1/laureates?limit=1000")

resp1 <- req_perform(req1)

raw_data1 <-
  resp_body_string(resp1) |>
  fromJSON(flatten = TRUE)

names(raw_data1)

[1] "laureates" "meta"      "links"

req2 <- request("https://api.nobelprize.org/2.1/nobelPrizes?limit=1000")

resp2 <- req_perform(req2)

raw_data2 <- 
  resp_body_string(resp2) |>
  fromJSON(flatten = TRUE)

names(raw_data2)

[1] "nobelPrizes" "meta"        "links"

4. Extract the data and convert it to a tibble

laureates_df <- as_tibble(raw_data1$laureates)

prizes_df <- as_tibble(raw_data2$nobelPrizes)

names(laureates_df)

  [1] "id"                               "fileName"                        
  [3] "gender"                           "sameAs"                          
  [5] "links"                            "nobelPrizes"                     
  [7] "acronym"                          "nativeName"                      
  [9] "penName"                          "knownName.en"                    
 [11] "knownName.se"                     "knownName.no"                    
 [13] "givenName.en"                     "givenName.se"                    
 [15] "givenName.no"                     "familyName.en"                   
 [17] "familyName.se"                    "familyName.no"                   
 [19] "fullName.en"                      "fullName.se"                     
 [21] "fullName.no"                      "birth.date"                      
 [23] "birth.year"                       "birth.place.city.en"             
 [25] "birth.place.city.no"              "birth.place.city.se"             
 [27] "birth.place.country.en"           "birth.place.country.no"          
 [29] "birth.place.country.se"           "birth.place.cityNow.en"          
 [31] "birth.place.cityNow.no"           "birth.place.cityNow.se"          
 [33] "birth.place.cityNow.sameAs"       "birth.place.cityNow.latitude"    
 [35] "birth.place.cityNow.longitude"    "birth.place.countryNow.en"       
 [37] "birth.place.countryNow.no"        "birth.place.countryNow.se"       
 [39] "birth.place.countryNow.sameAs"    "birth.place.countryNow.latitude" 
 [41] "birth.place.countryNow.longitude" "birth.place.continent.en"        
 [43] "birth.place.continent.no"         "birth.place.continent.se"        
 [45] "birth.place.locationString.en"    "birth.place.locationString.no"   
 [47] "birth.place.locationString.se"    "wikipedia.slug"                  
 [49] "wikipedia.english"                "wikidata.id"                     
 [51] "wikidata.url"                     "death.date"                      
 [53] "death.place.city.en"              "death.place.city.no"             
 [55] "death.place.city.se"              "death.place.country.en"          
 [57] "death.place.country.no"           "death.place.country.se"          
 [59] "death.place.country.sameAs"       "death.place.cityNow.en"          
 [61] "death.place.cityNow.no"           "death.place.cityNow.se"          
 [63] "death.place.cityNow.sameAs"       "death.place.cityNow.latitude"    
 [65] "death.place.cityNow.longitude"    "death.place.countryNow.en"       
 [67] "death.place.countryNow.no"        "death.place.countryNow.se"       
 [69] "death.place.countryNow.sameAs"    "death.place.countryNow.latitude" 
 [71] "death.place.countryNow.longitude" "death.place.continent.en"        
 [73] "death.place.continent.no"         "death.place.continent.se"        
 [75] "death.place.locationString.en"    "death.place.locationString.no"   
 [77] "death.place.locationString.se"    "orgName.en"                      
 [79] "orgName.no"                       "orgName.se"                      
 [81] "founded.date"                     "founded.place.city.en"           
 [83] "founded.place.city.no"            "founded.place.city.se"           
 [85] "founded.place.country.en"         "founded.place.country.no"        
 [87] "founded.place.country.se"         "founded.place.country.sameAs"    
 [89] "founded.place.cityNow.en"         "founded.place.cityNow.no"        
 [91] "founded.place.cityNow.se"         "founded.place.cityNow.sameAs"    
 [93] "founded.place.countryNow.en"      "founded.place.countryNow.no"     
 [95] "founded.place.countryNow.se"      "founded.place.countryNow.sameAs" 
 [97] "founded.place.continent.en"       "founded.place.continent.no"      
 [99] "founded.place.continent.se"       "founded.place.locationString.en" 
[101] "founded.place.locationString.no"  "founded.place.locationString.se" 
[103] "penNameOf.fullName"               "foundedCountry.en"               
[105] "foundedCountry.no"                "foundedCountry.se"               
[107] "foundedCountryNow.en"             "foundedCountryNow.no"            
[109] "foundedCountryNow.se"             "foundedContinent.en"

names(prizes_df)

 [1] "awardYear"           "dateAwarded"         "prizeAmount"        
 [4] "prizeAmountAdjusted" "links"               "laureates"          
 [7] "category.en"         "category.no"         "category.se"        
[10] "categoryFullName.en" "categoryFullName.no" "categoryFullName.se"
[13] "topMotivation.en"    "topMotivation.se"

prizes_laureates <- 
  prizes_df |>
  unnest(laureates, names_sep = "_", keep_empty = TRUE)

names(prizes_laureates)

 [1] "awardYear"               "dateAwarded"            
 [3] "prizeAmount"             "prizeAmountAdjusted"    
 [5] "links"                   "laureates_id"           
 [7] "laureates_portion"       "laureates_sortOrder"    
 [9] "laureates_links"         "laureates_knownName.en" 
[11] "laureates_fullName.en"   "laureates_motivation.en"
[13] "laureates_motivation.se" "laureates_motivation.no"
[15] "laureates_nativeName"    "laureates_orgName.en"   
[17] "laureates_orgName.no"    "laureates_knownName.no" 
[19] "laureates_acronym"       "category.en"            
[21] "category.no"             "category.se"            
[23] "categoryFullName.en"     "categoryFullName.no"    
[25] "categoryFullName.se"     "topMotivation.en"       
[27] "topMotivation.se"

5. Data cleaning

dealing with the data1

nobel_tidy <- 
  laureates_df |>
  unnest(nobelPrizes, names_sep = "_") |>
  mutate(
    award_year = as.integer(nobelPrizes_awardYear),
    category = nobelPrizes_category.en,
    laureate_name = coalesce(knownName.en, fullName.en),
    birth_year = as.integer(str_sub(birth.date, 1, 4)),
    award_year = as.integer(nobelPrizes_awardYear),
    birth_country = birth.place.country.en,
    birth_country_now = birth.place.countryNow.en,
    birth_continent = birth.place.continent.en,
    motivation = nobelPrizes_motivation.en
) |>
  select(
    id,
    laureate_name,
    gender,
    birth_year,
    award_year,
    birth_country,
    birth_country_now,
    birth_continent,
    award_year,
    category,
    motivation
)

head(nobel_tidy)

# A tibble: 6 × 10
  id    laureate_name     gender birth_year award_year birth_country            
  <chr> <chr>             <chr>       <int>      <int> <chr>                    
1 745   A. Michael Spence male         1943       2001 USA                      
2 102   Aage N. Bohr      male         1922       1975 Denmark                  
3 779   Aaron Ciechanover male         1947       2004 British Protectorate of …
4 259   Aaron Klug        male         1926       1982 Lithuania                
5 1004  Abdulrazak Gurnah male         1948       2021 <NA>                     
6 114   Abdus Salam       male         1926       1979 India                    
# ℹ 4 more variables: birth_country_now <chr>, birth_continent <chr>,
#   category <chr>, motivation <chr>

dealing with the data2

prizes_tidy <-
  prizes_laureates |>
  mutate(
    award_year = as.integer(awardYear),
    category = category.en,
    laureate_name = coalesce(
      laureates_knownName.en,
      laureates_fullName.en,
      laureates_orgName.en
    )
  ) |>
  select(
    award_year,
    category,
    id = laureates_id,
    laureate_name,
    motivation = laureates_motivation.en,
    prize_amount = prizeAmount,
    prize_amount_adjusted = prizeAmountAdjusted
  )

head(prizes_tidy)

# A tibble: 6 × 7
  award_year category               id    laureate_name  motivation prize_amount
       <int> <chr>                  <chr> <chr>          <chr>             <int>
1       1901 Chemistry              160   Jacobus H. va… in recogn…       150782
2       1901 Literature             569   Sully Prudhom… in specia…       150782
3       1901 Peace                  462   Henry Dunant   for his h…       150782
4       1901 Peace                  463   Frédéric Passy for his l…       150782
5       1901 Physics                1     Wilhelm Conra… in recogn…       150782
6       1901 Physiology or Medicine 293   Emil von Behr… for his w…       150782
# ℹ 1 more variable: prize_amount_adjusted <int>

6. Formulate four questions

gender_count <-
  nobel_tidy |>
  filter(!is.na(gender)) |>
  count(gender, sort = TRUE)

gender_count |>
  ggplot(aes(
    x = gender, 
    y = n, 
    fill = gender)
) +
  geom_col(width = 0.5) +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(
    title = "Number of Nobel Prize Laureates by Gender",
    x = "Gender",
    y = "Number of Laureates"
) +
  theme_minimal()

1. How are Nobel Prize laureates distributed by gender?

The results show that male laureates appear much more frequently than female laureates in the Nobel Prize data.

age_info <- nobel_tidy |>
  mutate(
    age_at_award = award_year - birth_year
) |>
  filter(!is.na(age_at_award))

youngest_oldest <-
  age_info |>
  filter(
    age_at_award == min(age_at_award) |
    age_at_award == max(age_at_award)
) |>
  select(
    laureate_name,
    award_year,
    category,
    age_at_award
) |>
  arrange(age_at_award)

youngest_oldest

# A tibble: 2 × 4
  laureate_name      award_year category  age_at_award
  <chr>                   <int> <chr>            <int>
1 Malala Yousafzai         2014 Peace               17
2 John B. Goodenough       2019 Chemistry           97

# Average age by decade
average_age_decade <- 
  age_info |>
  mutate(
    decade = floor(award_year / 10) * 10
) |>
  group_by(decade) |>
  summarise(
    average_age = mean(age_at_award),
    .groups = "drop"
)

ggplot(
  average_age_decade, 
  aes(x = factor(decade), 
      y = average_age,
      fill = decade)
) +
  geom_col(width = 0.6) +
  geom_text(aes(label = round(average_age, 1)), vjust = -0.3) +
  labs(
    title = "Average Age of Nobel Prize Laureates by Decade",
    x = "Decade",
    y = "Average Age at Award"
  ) +
  theme_minimal()

2. What are the youngest and oldest ages at which Nobel laureates received their prizes, and how has the average age at award changed by decade?

The youngest laureate in this data is Malala Yousafzai, who received the Peace prize in 2014 at age 17. The oldest laureate is John B. Goodenough, who received the Chemistry prize in 2019 at age 97.

The bar chart shows that the average age at award generally increased over time. Earlier decades were mostly in the 50s, while more recent decades are mostly in the 60s, with the highest average age appearing in the 2020s at about 68.5.

category_count <- 
  nobel_tidy |>
  filter(!is.na(category)) |>
  group_by(category) |>
  summarise(
    number_of_laureates = n_distinct(id),
    .groups = "drop"
) |>
  arrange(desc(number_of_laureates))

ggplot(
  category_count, 
  aes(x = reorder(category, number_of_laureates), 
      y = number_of_laureates,
      fill = category)
) +
  geom_col(width = 0.7) +
  geom_text(aes(label = number_of_laureates), hjust = -0.3) +
  coord_flip() +
  labs(
    title = "Number of Nobel Prize Laureates by Category",
    x = "Category",
    y = "Number of Laureates"
  ) +
  theme_minimal()

3. Which Nobel Prize category has the most laureates?

The category with the most laureates is Physiology or Medicine, with 231 laureates. The next highest categories are Physics with 224 laureates and Chemistry with 196 laureates. The category with the fewest laureates is Economic Sciences, with 99 laureates.

affil_info <-
  laureates_df |>
  unnest(nobelPrizes, names_sep = "_") |>
  unnest(nobelPrizes_affiliations, names_sep = "_", keep_empty = TRUE) |>
  transmute(
    id,
    award_year = as.integer(nobelPrizes_awardYear),
    category = nobelPrizes_category.en,
    affil_country = nobelPrizes_affiliations_country.en
) |>
  filter(!is.na(affil_country))

affil_df <-
  nobel_tidy |>
  left_join(affil_info, by = c("id", "award_year", "category")) |>
  filter(
    !is.na(birth_country),
    !is.na(affil_country),
    birth_country != affil_country
) |>
  select(
    laureate_name,
    birth_country,
    affil_country,
    award_year,
    category
) |>
  arrange(birth_country)

country_pairs <-
  affil_df |>
  count(birth_country, affil_country, sort = TRUE
) |>
  mutate(
    country_pair = paste(birth_country, "to", affil_country)
) |>
  slice_head(n = 10)

ggplot(country_pairs,
       aes(x = reorder(country_pair, n), y = n, fill = n)
) +
  geom_col(width = 0.6) +
  geom_text(aes(label = n), hjust = -0.2) +
  coord_flip() +
  labs(
    title = "Top 10 Birth Country and Affiliation Country Pairs",
    x = "Birth Country to Affiliation Country",
    y = "Number of Laureates"
  ) +
  theme_minimal()

4. Which birth-country and affiliation-country pairs appear most often among Nobel laureates?

The most common pairs are United Kingdom to USA and Germany to USA, both with 21 laureates.