Assignment 10B Approach

Author

Theresa Benny

Approach Deliverable

For this assignment, I will use the Nobel Prize public API to retrieve JSON data and transform it into tidy data frames in R. The Nobel Prize Developer Zone provides open data through a REST API, and the current API version is 2.1. The API returns Nobel Prize and laureate data in JSON or CSV format, which makes it appropriate for this assignment’s focus on JSON practice.

My first step will be to identify which Nobel API endpoint or endpoints are most useful for the questions I want to answer. Since the assignment requires four data-driven questions, I will likely work with prize-level data and laureate-level data so I can examine both award information and person-level details. After retrieving the JSON responses in R, I will parse the nested data and convert it into tidy tables. This will likely involve separating prize records from laureate records, expanding nested fields, and selecting the variables needed for analysis. Because JSON data often contains nested structures, one important part of the assignment will be reshaping the data into rectangular data frames that can be filtered, grouped, joined, and visualized.

Once the data is loaded into tidy format, I will formulate four questions that can be answered directly from the Nobel data. At least one of these questions will go beyond a simple count and require a more advanced transformation, such as joining laureate information to prize information, comparing birth-country fields to award-affiliation or citizenship-related fields, or examining changes across time and categories. This matches the assignment requirement that at least one question involve joining, filtering, or comparing multiple fields rather than only summarizing totals.

A good strategy will be to choose four questions with increasing complexity. I will begin with one or two straightforward descriptive questions, such as which prize category has the most laureates or which years had the highest number of awardees. Then I will include more analytical questions, such as comparing the geographic origins of laureates with the countries connected to their prize affiliations, identifying trends by decade, or examining how the number of shared prizes has changed over time. This approach will show both basic JSON handling and stronger data-wrangling skills.

For each question, I will clearly structure the report in four parts: the question itself, the code used to retrieve and process the relevant data, the resulting table or plot, and a short interpretation of the answer. This structure will make it easy to demonstrate that each conclusion comes directly from the JSON data and that the full workflow is reproducible in Quarto.

One anticipated challenge is that Nobel API data is nested and may include repeated subfields for laureates, prize motivations, affiliations, or locations. Because of this, I will need to inspect the structure carefully before tidying it. Another challenge is that some variables may be missing for certain records or may appear differently across organizations and individuals, so I will need to handle incomplete fields carefully. A third challenge is making sure that the questions are interesting enough to go beyond simple counts while still being clearly answerable from the available JSON data.

The final deliverable will be a single Quarto file containing all four questions, all R code used to retrieve and tidy the Nobel Prize JSON data, and the resulting answers in the form of tables, summaries, or plots. This file will demonstrate the full workflow from API retrieval to tidy analysis and interpretation.

Codebase

#Load packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jsonlite)

Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten
library(httr2)
library(tidyr)
prizes_url <- "https://api.nobelprize.org/2.1/nobelPrizes"
laureates_url <- "https://api.nobelprize.org/2.1/laureates"

prizes_raw <- fromJSON(prizes_url)
laureates_raw <- fromJSON(laureates_url)
names(prizes_raw)
[1] "nobelPrizes" "meta"        "links"      
names(laureates_raw)
[1] "laureates" "meta"      "links"    
prizes <- prizes_raw$nobelPrizes
laureates <- laureates_raw$laureates

glimpse(prizes)
Rows: 25
Columns: 8
$ awardYear           <chr> "1901", "1901", "1901", "1901", "1901", "1902", "1…
$ category            <df[,3]> <data.frame[25 x 3]>
$ categoryFullName    <df[,3]> <data.frame[25 x 3]>
$ dateAwarded         <chr> "1901-11-12", "1901-11-14", "1901-12-10", "1901…
$ prizeAmount         <int> 150782, 150782, 150782, 150782, 150782, 141847,…
$ prizeAmountAdjusted <int> 10833458, 10833458, 10833458, 10833458, 10833458, …
$ links               <list> [<data.frame[1 x 4]>], [<data.frame[1 x 4]>], [<da…
$ laureates           <list> [<data.frame[1 x 7]>], [<data.frame[1 x 7]>], [<da…
glimpse(laureates)
Rows: 25
Columns: 14
$ id          <chr> "745", "102", "779", "259", "1004", "114", "982", "981", "…
$ knownName   <df[,2]> <data.frame[25 x 2]>
$ givenName   <df[,2]> <data.frame[25 x 2]>
$ familyName  <df[,2]> <data.frame[25 x 2]>
$ fullName    <df[,2]> <data.frame[25 x 2]>
$ fileName    <chr> "spence", "bohr", "ciechanover", "klug", "gurnah", "sal…
$ gender      <chr> "male", "male", "male", "male", "male", "male", "male",…
$ birth       <df[,3]> <data.frame[25 x 3]>
$ wikipedia   <df[,2]> <data.frame[25 x 2]>
$ wikidata    <df[,2]> <data.frame[25 x 2]>
$ sameAs      <list> <"https://www.wikidata.org/wiki/Q157245", "https://en.w…
$ links       <list> [<data.frame[2 x 6]>], [<data.frame[2 x 6]>], [<data.fr…
$ nobelPrizes <list> [<data.frame[1 x 12]>], [<data.frame[1 x 12]>], [<data.fra…
$ death       <df[,2]> <data.frame[25 x 2]>
names(prizes)
[1] "awardYear"           "category"            "categoryFullName"   
[4] "dateAwarded"         "prizeAmount"         "prizeAmountAdjusted"
[7] "links"               "laureates"          
#Tidy the prices data
prizes_tidy <- prizes %>%
  select(awardYear, category, laureates) %>%
  unnest(laureates)
names(prizes_tidy)
 [1] "awardYear"  "category"   "id"         "knownName"  "fullName"  
 [6] "portion"    "sortOrder"  "motivation" "links"      "orgName"   
[11] "nativeName"
prizes_tidy$category[[1]]
 [1] "Chemistry"              "Literature"             "Peace"                 
 [4] "Peace"                  "Physics"                "Physiology or Medicine"
 [7] "Chemistry"              "Literature"             "Peace"                 
[10] "Peace"                  "Physics"                "Physics"               
[13] "Physiology or Medicine" "Chemistry"              "Literature"            
[16] "Peace"                  "Physics"                "Physics"               
[19] "Physics"                "Physiology or Medicine" "Chemistry"             
[22] "Literature"             "Literature"             "Peace"                 
[25] "Physics"                "Physiology or Medicine" "Chemistry"             
[28] "Literature"             "Peace"                  "Physics"               
[31] "Physiology or Medicine"
str(prizes_tidy$category[[1]])
 chr [1:31] "Chemistry" "Literature" "Peace" "Peace" "Physics" ...
#use simplified table
class(prizes_tidy$category)
[1] "data.frame"
str(prizes_tidy$category)
'data.frame':   31 obs. of  3 variables:
 $ en: chr  "Chemistry" "Literature" "Peace" "Peace" ...
 $ no: chr  "Kjemi" "Litteratur" "Fred" "Fred" ...
 $ se: chr  "Kemi" "Litteratur" "Fred" "Fred" ...
prizes_simple <- prizes_tidy %>%
  mutate(
    category = category$en
  )

head(prizes_simple[, c("awardYear", "category", "knownName")], 5)
# A tibble: 5 × 3
  awardYear category   knownName$en          
  <chr>     <chr>      <chr>                 
1 1901      Chemistry  Jacobus H. van 't Hoff
2 1901      Literature Sully Prudhomme       
3 1901      Peace      Henry Dunant          
4 1901      Peace      Frédéric Passy        
5 1901      Physics    Wilhelm Conrad Röntgen
class(prizes_simple$knownName)
[1] "data.frame"
str(prizes_simple$knownName)
'data.frame':   31 obs. of  1 variable:
 $ en: chr  "Jacobus H. van 't Hoff" "Sully Prudhomme" "Henry Dunant" "Frédéric Passy" ...
class(prizes_simple$knownName)
[1] "data.frame"
str(prizes_simple$knownName)
'data.frame':   31 obs. of  1 variable:
 $ en: chr  "Jacobus H. van 't Hoff" "Sully Prudhomme" "Henry Dunant" "Frédéric Passy" ...
head(prizes_simple[, c("awardYear", "category", "knownName")], 10)
# A tibble: 10 × 3
   awardYear category               knownName$en          
   <chr>     <chr>                  <chr>                 
 1 1901      Chemistry              Jacobus H. van 't Hoff
 2 1901      Literature             Sully Prudhomme       
 3 1901      Peace                  Henry Dunant          
 4 1901      Peace                  Frédéric Passy        
 5 1901      Physics                Wilhelm Conrad Röntgen
 6 1901      Physiology or Medicine Emil von Behring      
 7 1902      Chemistry              Emil Fischer          
 8 1902      Literature             Theodor Mommsen       
 9 1902      Peace                  Élie Ducommun         
10 1902      Peace                  Albert Gobat          
#check what other columns are available in prizes_simple
names(prizes_simple)
 [1] "awardYear"  "category"   "id"         "knownName"  "fullName"  
 [6] "portion"    "sortOrder"  "motivation" "links"      "orgName"   
[11] "nativeName"

Question 1: Which Nobel Prize category has the most laureates?

#count prizes by category
category_counts <- prizes_simple %>%
  count(category, sort = TRUE)

category_counts
# A tibble: 5 × 2
  category                   n
  <chr>                  <int>
1 Physics                    8
2 Peace                      7
3 Literature                 6
4 Chemistry                  5
5 Physiology or Medicine     5
#create a plot
ggplot(category_counts, aes(x = reorder(category, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Number of Nobel Laureates by Category",
    x = "Category",
    y = "Count"
  )

#how Nobel Prizes have been awarded over time
year_counts <- prizes_simple %>%
  count(awardYear)

head(year_counts)
# A tibble: 5 × 2
  awardYear     n
  <chr>     <int>
1 1901          6
2 1902          7
3 1903          7
4 1904          6
5 1905          5
ggplot(year_counts, aes(x = as.numeric(awardYear), y = n)) +
  geom_line() +
  labs(
    title = "Number of Nobel Laureates Over Time",
    x = "Year",
    y = "Number of Laureates"
  )

#find most recent years in the dataset

sort(unique(prizes_simple$awardYear), decreasing = TRUE)[1:10]
 [1] "1905" "1904" "1903" "1902" "1901" NA     NA     NA     NA     NA    

It looks like the API link only pulled 5 years, so I am rerunning the API link so there’s more data.

prizes_raw <- fromJSON("https://api.nobelprize.org/2.1/nobelPrizes?limit=1000")
prizes <- prizes_raw$nobelPrizes

prizes_tidy <- prizes %>%
  select(awardYear, category, laureates) %>%
  unnest(laureates)

nrow(prizes_tidy)
[1] 1026
prizes_simple <- prizes_tidy %>%
  mutate(
    category = category$en
  )

head(prizes_simple[, c("awardYear", "category", "knownName")], 5)
# A tibble: 5 × 3
  awardYear category   knownName$en           $no  
  <chr>     <chr>      <chr>                  <chr>
1 1901      Chemistry  Jacobus H. van 't Hoff <NA> 
2 1901      Literature Sully Prudhomme        <NA> 
3 1901      Peace      Henry Dunant           <NA> 
4 1901      Peace      Frédéric Passy         <NA> 
5 1901      Physics    Wilhelm Conrad Röntgen <NA> 
class(prizes_simple$knownName)
[1] "data.frame"
str(prizes_simple$knownName)
'data.frame':   1026 obs. of  2 variables:
 $ en: chr  "Jacobus H. van 't Hoff" "Sully Prudhomme" "Henry Dunant" "Frédéric Passy" ...
 $ no: chr  NA NA NA NA ...
prizes_simple$knownName <- prizes_tidy$knownName$en
head(prizes_simple[, c("awardYear", "category", "knownName")], 5)
# A tibble: 5 × 3
  awardYear category   knownName             
  <chr>     <chr>      <chr>                 
1 1901      Chemistry  Jacobus H. van 't Hoff
2 1901      Literature Sully Prudhomme       
3 1901      Peace      Henry Dunant          
4 1901      Peace      Frédéric Passy        
5 1901      Physics    Wilhelm Conrad Röntgen
category_counts <- prizes_simple %>%
  count(category, sort = TRUE)

category_counts
# A tibble: 6 × 2
  category                   n
  <chr>                  <int>
1 Physiology or Medicine   232
2 Physics                  230
3 Chemistry                200
4 Peace                    143
5 Literature               122
6 Economic Sciences         99
  1. How has the number of Nobel laureates changed over time?
year_counts <- prizes_simple %>%
  count(awardYear)

head(year_counts)
# A tibble: 6 × 2
  awardYear     n
  <chr>     <int>
1 1901          6
2 1902          7
3 1903          7
4 1904          6
5 1905          5
6 1906          6
ggplot(year_counts, aes(x = as.numeric(awardYear), y = n)) +
  geom_line() +
  labs(
    title = "Number of Nobel Laureates Over Time",
    x = "Year",
    y = "Number of Laureates"
  )

3: How do Nobel Prize categories compare before and after 2000?

prizes_time_period <- prizes_simple %>%
  mutate(
    period = ifelse(as.numeric(awardYear) < 2000, "Before 2000", "2000 and After")
  )

category_period_counts <- prizes_time_period %>%
  count(period, category)

category_period_counts
# A tibble: 12 × 3
   period         category                   n
   <chr>          <chr>                  <int>
 1 2000 and After Chemistry                 68
 2 2000 and After Economic Sciences         55
 3 2000 and After Literature                26
 4 2000 and After Peace                     37
 5 2000 and After Physics                   71
 6 2000 and After Physiology or Medicine    63
 7 Before 2000    Chemistry                132
 8 Before 2000    Economic Sciences         44
 9 Before 2000    Literature                96
10 Before 2000    Peace                    106
11 Before 2000    Physics                  159
12 Before 2000    Physiology or Medicine   169
ggplot(category_period_counts, aes(x = category, y = n, fill = period)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "Nobel Laureates by Category: Before vs After 2000",
    x = "Category",
    y = "Count"
  )

4: Which laureates have won more than one Nobel Prize?

repeat_winners <- prizes_simple %>%
  filter(!is.na(knownName)) %>%
  count(knownName, sort = TRUE) %>%
  filter(n > 1)

repeat_winners
# A tibble: 5 × 2
  knownName              n
  <chr>              <int>
1 Frederick Sanger       2
2 John Bardeen           2
3 K. Barry Sharpless     2
4 Linus Pauling          2
5 Marie Curie            2