For this assignment, I will use the Nobel Prize public API to retrieve JSON data and transform it into tidy data frames in R. The Nobel Prize Developer Zone provides open data through a REST API, and the current API version is 2.1. The API returns Nobel Prize and laureate data in JSON or CSV format, which makes it appropriate for this assignment’s focus on JSON practice.
My first step will be to identify which Nobel API endpoint or endpoints are most useful for the questions I want to answer. Since the assignment requires four data-driven questions, I will likely work with prize-level data and laureate-level data so I can examine both award information and person-level details. After retrieving the JSON responses in R, I will parse the nested data and convert it into tidy tables. This will likely involve separating prize records from laureate records, expanding nested fields, and selecting the variables needed for analysis. Because JSON data often contains nested structures, one important part of the assignment will be reshaping the data into rectangular data frames that can be filtered, grouped, joined, and visualized.
Once the data is loaded into tidy format, I will formulate four questions that can be answered directly from the Nobel data. At least one of these questions will go beyond a simple count and require a more advanced transformation, such as joining laureate information to prize information, comparing birth-country fields to award-affiliation or citizenship-related fields, or examining changes across time and categories. This matches the assignment requirement that at least one question involve joining, filtering, or comparing multiple fields rather than only summarizing totals.
A good strategy will be to choose four questions with increasing complexity. I will begin with one or two straightforward descriptive questions, such as which prize category has the most laureates or which years had the highest number of awardees. Then I will include more analytical questions, such as comparing the geographic origins of laureates with the countries connected to their prize affiliations, identifying trends by decade, or examining how the number of shared prizes has changed over time. This approach will show both basic JSON handling and stronger data-wrangling skills.
For each question, I will clearly structure the report in four parts: the question itself, the code used to retrieve and process the relevant data, the resulting table or plot, and a short interpretation of the answer. This structure will make it easy to demonstrate that each conclusion comes directly from the JSON data and that the full workflow is reproducible in Quarto.
One anticipated challenge is that Nobel API data is nested and may include repeated subfields for laureates, prize motivations, affiliations, or locations. Because of this, I will need to inspect the structure carefully before tidying it. Another challenge is that some variables may be missing for certain records or may appear differently across organizations and individuals, so I will need to handle incomplete fields carefully. A third challenge is making sure that the questions are interesting enough to go beyond simple counts while still being clearly answerable from the available JSON data.
The final deliverable will be a single Quarto file containing all four questions, all R code used to retrieve and tidy the Nobel Prize JSON data, and the resulting answers in the form of tables, summaries, or plots. This file will demonstrate the full workflow from API retrieval to tidy analysis and interpretation.
Codebase
#Load packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
# A tibble: 10 × 3
awardYear category knownName$en
<chr> <chr> <chr>
1 1901 Chemistry Jacobus H. van 't Hoff
2 1901 Literature Sully Prudhomme
3 1901 Peace Henry Dunant
4 1901 Peace Frédéric Passy
5 1901 Physics Wilhelm Conrad Röntgen
6 1901 Physiology or Medicine Emil von Behring
7 1902 Chemistry Emil Fischer
8 1902 Literature Theodor Mommsen
9 1902 Peace Élie Ducommun
10 1902 Peace Albert Gobat
#check what other columns are available in prizes_simplenames(prizes_simple)
Question 1: Which Nobel Prize category has the most laureates?
#count prizes by categorycategory_counts <- prizes_simple %>%count(category, sort =TRUE)category_counts
# A tibble: 5 × 2
category n
<chr> <int>
1 Physics 8
2 Peace 7
3 Literature 6
4 Chemistry 5
5 Physiology or Medicine 5
#create a plotggplot(category_counts, aes(x =reorder(category, n), y = n)) +geom_col() +coord_flip() +labs(title ="Number of Nobel Laureates by Category",x ="Category",y ="Count" )
#how Nobel Prizes have been awarded over timeyear_counts <- prizes_simple %>%count(awardYear)head(year_counts)
ggplot(year_counts, aes(x =as.numeric(awardYear), y = n)) +geom_line() +labs(title ="Number of Nobel Laureates Over Time",x ="Year",y ="Number of Laureates" )
#find most recent years in the datasetsort(unique(prizes_simple$awardYear), decreasing =TRUE)[1:10]
[1] "1905" "1904" "1903" "1902" "1901" NA NA NA NA NA
It looks like the API link only pulled 5 years, so I am rerunning the API link so there’s more data.
# A tibble: 5 × 3
awardYear category knownName$en $no
<chr> <chr> <chr> <chr>
1 1901 Chemistry Jacobus H. van 't Hoff <NA>
2 1901 Literature Sully Prudhomme <NA>
3 1901 Peace Henry Dunant <NA>
4 1901 Peace Frédéric Passy <NA>
5 1901 Physics Wilhelm Conrad Röntgen <NA>
class(prizes_simple$knownName)
[1] "data.frame"
str(prizes_simple$knownName)
'data.frame': 1026 obs. of 2 variables:
$ en: chr "Jacobus H. van 't Hoff" "Sully Prudhomme" "Henry Dunant" "Frédéric Passy" ...
$ no: chr NA NA NA NA ...
ggplot(year_counts, aes(x =as.numeric(awardYear), y = n)) +geom_line() +labs(title ="Number of Nobel Laureates Over Time",x ="Year",y ="Number of Laureates" )
3: How do Nobel Prize categories compare before and after 2000?
# A tibble: 12 × 3
period category n
<chr> <chr> <int>
1 2000 and After Chemistry 68
2 2000 and After Economic Sciences 55
3 2000 and After Literature 26
4 2000 and After Peace 37
5 2000 and After Physics 71
6 2000 and After Physiology or Medicine 63
7 Before 2000 Chemistry 132
8 Before 2000 Economic Sciences 44
9 Before 2000 Literature 96
10 Before 2000 Peace 106
11 Before 2000 Physics 159
12 Before 2000 Physiology or Medicine 169
ggplot(category_period_counts, aes(x = category, y = n, fill = period)) +geom_col(position ="dodge") +coord_flip() +labs(title ="Nobel Laureates by Category: Before vs After 2000",x ="Category",y ="Count" )
4: Which laureates have won more than one Nobel Prize?