Nobel Prize API - Project Approach
Overview
This project uses the Nobel Prize public API to pull structured JSON data and analyze it in R. The goal is to answer five data-driven questions about Nobel Prize trends, using the laureate and prize endpoints from the Nobel Prize Developer Zone.
Data Source
The Nobel Prize organization provides two public APIs, one focused on prizes and one on laureates. Both return JSON. We will pull from both, convert them into tidy data frames in R, and join them as needed. No API key is required.
The Five Questions
I thought of five interesting questions, so I’m adding an additional one. Hope you don’t mind!
Q1 — Which institutions have produced the most Nobel laureates?
We will parse the affiliation field from the laureate data, clean institution names, and rank them by count. The main challenge is inconsistent naming — the same university may appear several different ways.
Q2 — Which countries win the most prizes relative to their population?
We will match laureate birth or citizenship country to external population data, then compute a prizes-per-million ratio. This requires a join with an outside dataset, likely from the World Bank.
Q3 — How has the gender breakdown changed over time and across categories?
We will group laureates by decade and prize category using the gender field, then visualize the trend. This is mostly a grouping and plotting task.
Q4 — Which countries lost the most laureates to emigration?
We will compare the born country field to the country the laureate represented when awarded. Where those differ, we count that as a “loss” for the birth country. This is the most complex question since it requires filtering and comparing across two fields.
Q5 — Which countries dominate specific prize categories?
We will cross-tabulate country and prize category, then look at share within each category. A heatmap will likely be the best way to show this.
General Approach in R
- Call the API using
httrorjsonlite - Flatten the nested JSON into data frames
- Clean and standardize key fields like country names and institution names
- Join datasets where needed
- Answer each question with a summary table or plot using
ggplot2anddplyr
Limitations
Country name inconsistencies. Country names in the data reflect political borders at the time of the award. Someone born in what is now Poland may be listed under a country that no longer exists. This makes the migration question harder to interpret cleanly.
Missing data. Not every laureate has complete affiliation, gender, or country information. Organizations that win the Peace Prize, for example, have no birth country or gender.
Population data is a snapshot. For the per-capita question, we will use current population figures. This is not perfectly accurate for prizes awarded decades ago when populations were very different.
Affiliation field is inconsistent. Institutions are listed as free text, so the same school may appear under several different names. Some manual cleaning will be needed.
Gender data is binary. The API only records male or female. This is a limitation of the data itself, not our analysis.
Potential Hurdles
- Flattening deeply nested JSON in R can be tricky, especially when laureates have multiple prizes or affiliations listed as arrays.
- Joining Nobel country data to World Bank population data will require careful matching since country names will not always align perfectly.
- The migration question depends on two country fields that are not always both filled in, so we may lose some records.
- Some prize categories only have a few decades of data — Economics was not introduced until 1969 — so comparisons across all categories need to account for that.