Week10B

Author

Sinem K Moschos

Approach

For this assignment, I will use the Nobel Prize API to work with JSON data and answer 4 questions using Nobel Prize information.

The main goal of this assignment is to practice working with JSON from an API, then turning that data into tidy data frames that can be explored in R.

Retrieve the JSON data from the Nobel Prize API

First, I will connect to the Nobel Prize API directly from R instead of downloading a file. I plan to use one or both of these endpoints:

  • Nobel Prize data
  • Laureate data

Load the JSON data into R

Next, I will use R packages for JSON and tidy data work. My plan is to load the API response into R, inspect the structure and identify which parts of the JSON need to be extracted.

API JSON usually comes in nested format, I expect that some fields may need extra cleaning or unnesting before analysis.

Transform the JSON into tidy data frames

After loading the JSON data, I will convert the important parts into tidy data frames. That way each variable should have its own column and each observation should have its own row.

For example:

  • select useful fields
  • unnest nested columns
  • clean text fields
  • separate prize-level and laureate-level information
  • join data frames when needed

Not only retrieving JSON, but also transforming it into tidy data frames.

Create four data-driven questions

After the data is cleaned, I will create four questions that can be answered from the Nobel Prize data.

one question will go beyond a basic count and will require filtering, joining, or comparing multiple fields.

Answer each question with code and results

For each of the four questions, I will include:

  • the question itself
  • the R code used to answer it
  • the result, shown as a table, summary, or visualization

Data Source

Nobel Prize API: https://api.nobelprize.org/2.1/nobelPrizes https://api.nobelprize.org/2.1/laureates

Code Base

In this section, I uretrieve JSON data directly from the Nobel Prize API, turn it into tidy data frames and answer four questions from the data. The Nobel Prize Developer Zone says the API is in JSON format, includes endpoints such as nobelPrizes and laureates.

library(jsonlite)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

Attaching package: 'purrr'
The following object is masked from 'package:jsonlite':

    flatten
library(stringr)
library(ggplot2)

I define the two API links and read the JSON data into R.

laureates_url <- "https://api.nobelprize.org/2.1/laureates?limit=1000"
prizes_url <- "https://api.nobelprize.org/2.1/nobelPrizes?limit=1000"

laureates_json <- fromJSON(laureates_url, flatten = TRUE)
prizes_json <- fromJSON(prizes_url, flatten = TRUE)

I convert the JSON objects into data frames.

laureates_raw <- laureates_json$laureates
prizes_raw <- prizes_json$nobelPrizes

I create a tidy laureates table with organization information.

laureates_df <- laureates_raw %>%
  transmute(
    laureate_id = id,
    known_name = coalesce(fullName.en, orgName.en),
    gender = gender,
    birth_country = birth.place.country.en
  )

I create another table that expands the prizes each laureate received. Here I used LLM to get some help and run functions correctly.

laureate_prizes_df <- laureates_raw %>%
  select(id, fullName.en, orgName.en, gender, nobelPrizes) %>%
  unnest(nobelPrizes) %>%
  transmute(
    laureate_id = id,
    known_name = coalesce(fullName.en, orgName.en),
    gender = gender,
    award_year = as.integer(awardYear),
    category = category.en
  )

I also create a prize-level table from the nobel Prizes endpoint.

prizes_df <- prizes_raw %>%
  transmute(
    award_year = as.integer(awardYear),
    category = category.en
  )

Question 1: Which Nobel Prize categories have been awarded the most times?

This question looks at how often each category appears in the prize data.

q1_table <- prizes_df %>%
  count(category, sort = TRUE)

ggplot(q1_table, aes(x = reorder(category, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Number of Nobel Prizes by Category",
    x = "Category",
    y = "Number of Awards"
  )

Question 2: Which countries have produced the most Nobel laureates by place of birth?

This question uses the laureates table and counts birth countries.

q2_table <- laureates_df %>%
  filter(!is.na(birth_country), birth_country != "") %>%
  count(birth_country, sort = TRUE) %>%
  slice_head(n = 10)

q2_table
     birth_country   n
1              USA 295
2   United Kingdom  93
3          Germany  78
4           France  60
5           Sweden  30
6            Japan  27
7           Canada  22
8  the Netherlands  20
9      Switzerland  19
10           Italy  18

Question 3: Which laureates have won more than one Nobel Prize?

This question goes beyond just counting categories because it groups by laureate and compares how many prizes each person or organization received.The table lists laureates who have received more than one Nobel Prize.

q3_table <- laureate_prizes_df %>%
  count(known_name, sort = TRUE) %>%
  filter(n > 1)

q3_table
# A tibble: 7 × 2
  known_name                                                      n
  <chr>                                                       <int>
1 International Committee of the Red Cross                        3
2 Frederick Sanger                                                2
3 John Bardeen                                                    2
4 K. Barry Sharpless                                              2
5 Linus Carl Pauling                                              2
6 Marie Curie, née Skłodowska                                     2
7 Office of the United Nations High Commissioner for Refugees     2

Question 4: Which country lost the most Nobel laureates, meaning they were born there but awarded while affiliated with an organization in another country?

This question is the more advanced one for the assignment because it requires joining, unnesting, filtering and comparing different fields across the data.I expand the affiliation information from the laureates prize records.

affiliation_df <- laureates_raw %>%
  select(id, fullName.en, orgName.en, nobelPrizes) %>%
  unnest(nobelPrizes) %>%
  unnest(affiliations) %>%
  transmute(
    laureate_id = id,
    known_name = coalesce(fullName.en, orgName.en),
    award_year = as.integer(awardYear),
    category = category.en,
    affiliation_name = name.en,
    affiliation_city = city.en,
    affiliation_country = country.en
  )

affiliation_df
# A tibble: 839 × 7
   laureate_id known_name  award_year category affiliation_name affiliation_city
   <chr>       <chr>            <int> <chr>    <chr>            <chr>           
 1 745         A. Michael…       2001 Economi… Stanford Univer… Stanford, CA    
 2 102         Aage Niels…       1975 Physics  Niels Bohr Inst… Copenhagen      
 3 779         Aaron Ciec…       2004 Chemist… Technion - Isra… Haifa           
 4 259         Aaron Klug        1982 Chemist… MRC Laboratory … Cambridge       
 5 114         Abdus Salam       1979 Physics  International C… Trieste         
 6 114         Abdus Salam       1979 Physics  Imperial College London          
 7 982         Abhijit Ba…       2019 Economi… Massachusetts I… Cambridge, MA   
 8 843         Ada E. Yon…       2009 Chemist… Weizmann Instit… Rehovot         
 9 866         Adam G. Ri…       2011 Physics  Johns Hopkins U… Baltimore, MD   
10 866         Adam G. Ri…       2011 Physics  Space Telescope… Baltimore, MD   
# ℹ 829 more rows
# ℹ 1 more variable: affiliation_country <chr>

Then I join birth country from the laureates table and compare it to the affiliation country.

q4_table <- affiliation_df %>%
  left_join(
    laureates_df %>% select(laureate_id, known_name, birth_country),
    by = c("laureate_id", "known_name")
  ) %>%
  filter(
    !is.na(birth_country), birth_country != "",
    !is.na(affiliation_country), affiliation_country != "",
    birth_country != affiliation_country
  ) %>%
  count(birth_country, sort = TRUE) %>%
  slice_head(n = 10)

q4_table
# A tibble: 10 × 2
   birth_country       n
   <chr>           <int>
 1 Germany            33
 2 United Kingdom     27
 3 France             16
 4 Canada             15
 5 Russia             14
 6 Austria-Hungary    12
 7 Prussia            12
 8 the Netherlands    11
 9 Russian Empire     10
10 Hungary             9

A plot for this comparison.

ggplot(q4_table, aes(x = reorder(birth_country, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Countries Losing Laureates to Other Award Affiliations",
    x = "Birth Country",
    y = "Number of Laureates Awarded Elsewhere"
  )

Conclusion

I used the Nobel Prize API JSON data directly from the public endpoints, then transformed the nested JSON into tidy data frames using unnest(), transmute(), count() and left_join(). The Nobel Prize Developer Zone states that the data is available through the laureates and nobelPrizes endpoints and is updated regularly.