library(RCurl)
library(jsonlite)
library(purrr)
library(httr2)
library(tidyverse)
library(lubridate)
The data we’ll be using is about the Nobel Prize Laureates. This
information can be accessed virtually through the public API: https://app.swaggerhub.com/apis/NobelMedia/NobelMasterData/2.1
Furthermore, the actual database is found on http://api.nobelprize.org/2.1/laureates?limit=SETLIMIT
Where the limit of how many individuals you want to observe can be
set manually. This site will be the one used for the purposes of this
markdown. All of these sites and endpoints are sourced from https://www.nobelprize.org/about/developer-zone-2/
# Create url request
req <- request("http://api.nobelprize.org/2.1/laureates?limit=1050")
resp <- req_perform(req)
# Extract json
laureates_json <- resp %>%
resp_body_json(flatten = TRUE)
The initial request has the limit set to 1050, it was an arbitrary number but does seem to collect all the laureates.
First tidy the JSON into a dataframe. Note that laureates with multiple prizes did not have their prizes properly represented in the dataframe.
laureates <- laureates_json$laureates %>%
map_dfr(function(laureate) {
tibble(
name = if (!is.null(laureate$fullName$en)){
laureate$fullName$en
} else{
laureate$orgName$en
},
country = if (!is.null(laureate$birth)) {
laureate$birth$place$country$en
} else{
laureate$founded$place$country$en
},
birthday = if (!is.null(laureate$birth)){
laureate$birth$date
} else{
laureate$founded$date
},
gender = laureate$gender,
year.awarded = laureate$nobelPrizes[[1]]$awardYear,
category = laureate$nobelPrizes[[1]]$category$en,
prize.amount.adjusted = laureate$nobelPrizes[[1]]$prizeAmountAdjusted,
affiliation = laureate$nobelPrizes[[1]]$affiliations[[1]]$name$en,
affiliation.country = laureate$nobelPrizes[[1]]$affiliations[[1]]$country$en
)
})
The final result of this importing and tidying is the dataframe laureates which contains the information: name, home country, birthday (or founding date), gender, category (of prize), prize amount, and affiliations, and affiliated country. Some of these values don’t have data for natural reasons - such as if a laureate was an organization it won’t have a gender. But, some other missing values are incomplete from the source.
1. Which country retained the most nobel laureates?
First, create new column to track if home country equals affiliated country, NA’s in either column will result in NA.
# first ifelse is to filter NA's, the second is to compare the columns
laureates$country.retained <- ifelse(
!is.na(laureates$country) &
!is.na(laureates$affiliation.country),
ifelse(
laureates$country == laureates$affiliation.country,
"yes",
"no"
),
NA
)
Then create summary
laureates %>%
filter(!is.na(country)) %>%
group_by(country) %>%
summarise(
laureates.born = n(),
retained = sum(country.retained == "yes", na.rm = TRUE),
ratio = retained / laureates.born
)
## # A tibble: 101 × 4
## country laureates.born retained ratio
## <chr> <int> <int> <dbl>
## 1 Argentina 4 1 0.25
## 2 Australia 11 3 0.273
## 3 Austria 17 5 0.294
## 4 Austria-Hungary 13 0 0
## 5 Austrian Empire 4 0 0
## 6 Bangladesh 1 0 0
## 7 Bavaria 1 0 0
## 8 Belgian Congo 1 0 0
## 9 Belgium 10 4 0.4
## 10 Bosnia 1 0 0
## # ℹ 91 more rows
This shows that the USA has the highest retention rate for laureates, with 261 of their 302 naturally born laureates winning the nobel prize while still affiliated with the USA.
2. The second question to be answered will be the one provided
in the assignment description.
Which country “lost” the most nobel laureates (who were born there
but received their Nobel prize as a citizen of a different country)?
The code looks similar, except the ratio column will be replaced with the difference between laureates born and retained. This will show a better indicator of how many laureates a country lost as compared to ratios.
laureates %>%
filter(!is.na(country)) %>%
group_by(country) %>%
summarise(
laureates.born = n(),
retained = sum(country.retained == "yes", na.rm = TRUE),
difference = laureates.born - retained
)
## # A tibble: 101 × 4
## country laureates.born retained difference
## <chr> <int> <int> <int>
## 1 Argentina 4 1 3
## 2 Australia 11 3 8
## 3 Austria 17 5 12
## 4 Austria-Hungary 13 0 13
## 5 Austrian Empire 4 0 4
## 6 Bangladesh 1 0 1
## 7 Bavaria 1 0 1
## 8 Belgian Congo 1 0 1
## 9 Belgium 10 4 6
## 10 Bosnia 1 0 1
## # ℹ 91 more rows
Surprisingly, the USA wins this again with having lost 41 laureates. Though Germany and the UK are tied for second place with 37 laureates lost each. Though, between the two the UK has the higher retention rate.
3. On average, which category of prize of awarded at the highest
age?
First create column that indicates age of laureate at time of award.
laureates$age.awarded <- as.numeric(laureates$year.awarded) -
as.numeric(substr(laureates$birthday, 1, 4))
Note that the ages of organizations were ignored, as the question intends to focus on individual laureates. This is achieved by filtering out cases where the gender is NA.
laureates %>%
filter(!is.na(gender)) %>%
group_by(category) %>%
summarise(
average.age = mean(age.awarded, na.rm = TRUE),
youngest = min(age.awarded, na.rm = TRUE),
oldest = max(age.awarded, na.rm = TRUE),
total = n()
)
## # A tibble: 6 × 5
## category average.age youngest oldest total
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Chemistry 59.2 35 97 197
## 2 Economic Sciences 67.0 47 90 99
## 3 Literature 65.0 42 88 122
## 4 Peace 60.8 17 87 111
## 5 Physics 57.7 25 96 229
## 6 Physiology or Medicine 58.9 32 87 232
From this we can see that on average, Economics Science laureates tend to be older. This raises another interesting question as to why this trend is the case. Considering also Economic Sciences has the least number of awarded laureates.
laureates %>%
filter(!is.na(gender)) %>%
group_by(gender) %>%
summarise(
average.amount = mean(prize.amount.adjusted, na.rm = TRUE),
smallest = min(prize.amount.adjusted, na.rm = TRUE),
largest = max(prize.amount.adjusted, na.rm = TRUE),
total = n()
)
## # A tibble: 2 × 5
## gender average.amount smallest largest total
## <chr> <dbl> <int> <int> <int>
## 1 female 9892634. 3161325 14930730 67
## 2 male 7904780. 3006134 15547541 923
We can see that on average, female laureates are awarded higher amounts, but there is a significant difference between the total number of male and female laureates.