Assignment 10B: More JSON Practice
Approach
Approach
Data Source
I will use the Nobel Prize public API, which is available at the Nobel Prize Developer Zone and requires no authentication. The API returns data in JSON format and covers every prize awarded since 1901. I will work with two endpoints throughout this assignment.
The first endpoint is the Nobel Prizes endpoint, which returns one record per prize per year. Each record includes the award year, the prize category, and a list of laureates who received that prize. The second endpoint is the Laureates endpoint, which returns detailed records for every individual or organization that has ever won. Each laureate record includes biographical information such as birth country and gender, along with the prizes they received and the institution they were affiliated with at the time of the award.
I will retrieve both endpoints in R using the httr2 package, handle pagination so that I capture all records rather than just the first page, and then flatten the nested JSON structures into tidy data frames ready for analysis.
Four Questions
Question 1: Which prize category has given out the most awards over time?
I want to know whether all six Nobel categories have been equally active historically or if some have awarded prizes far more often than others. To answer this I will use the prize level data, group it by category, and count the total number of prizes awarded in each one since 1901. I expect Medicine or Physiology, Physics, and Chemistry to lead because they have run continuously since the beginning, while Economics will be at the bottom since it only started in 1969.
Question 2: Which countries have produced the most Nobel laureates by birth?
This question focuses on origin rather than residence. Looking at birth country rather than citizenship at the time of the award gives a clearer picture of where Nobel level talent originates. I will filter out organizations since they do not have birth countries, deduplicate so each person is counted once, and then count individual laureates by birth country. I will display the top fifteen countries as a horizontal bar chart. This question also sets up the comparison in Question 3, because a country ranking high here does not necessarily mean it kept that talent.
Question 3: Which country lost the most Nobel level talent to other countries?
This is the more complex question that requires joining and comparing fields across the data. My plan is to take each individual laureate and compare their birth country against the country where their affiliated institution was located when they received the prize. If those two countries differ, I will count the birth country as having lost that person and the affiliate country as having gained them. I will then calculate a net figure for each country by subtracting the number of laureates affiliated there from the number born there. A large positive number means that country produced talent that went on to win elsewhere. I expect Germany and several Eastern European countries to show the largest gaps, largely because of emigration waves in the twentieth century, including Jewish scientists fleeing Nazi Germany in the 1930s.
Question 4: How does female representation compare across prize categories?
Women are underrepresented among Nobel laureates overall, but I want to know whether that gap is consistent across all six categories or whether certain fields look noticeably better or worse than others. I will filter to individual people, group by category and gender, and calculate the percentage female in each category. I will present this as a stacked percentage bar chart so it is easy to compare all categories side by side. I expect Literature and Peace to have the highest female representation and Physics to have the lowest, but I want to see the actual proportions rather than assume.
Analysis Plan
For each question I will follow the same general sequence. I will start by identifying the relevant fields from the tidy data frames, filter or join as needed, summarize the data down to the level the question requires, and then present the result as either a table or a visualization depending on which communicates the answer more clearly. For Question 3 specifically, the join step is essential because the answer cannot be found in either endpoint alone. It requires matching birth country from the Laureates endpoint against affiliation country from the prize records nested within the same endpoint, and then aggregating the difference across all laureates for each country.