This advance report provides an analysis of Nobel Prize Winners in their fields since the very first Nobel Prize in 1901 to 2016. This document illustrates some ways to analyse data from the Nobel Prize API using the R programming language. The analysis includes multiple prize winners, gender gap, gender by category — “chemistry, economics, literature, medicine or physiology, peace, and physics”; gender over time, nobel prize share, distribution by age, and distribution by countries etc. Some adavnce analysis like — clustering, classifications, finite mixture models and principal component analysis, also included to get more inference related to Nobel Prize and it’s Laureates.
Thankfully, the Nobel Foundation & Nobelprize.org - is a registered trademark, and is produced, managed and maintained by Nobel Media, had created exactly the database of information for every Nobel Prize since 1901, including the Nobel Laureate’s biographies, Nobel Lectures, interviews, photos, articles, video clips, and press releases. Nobelprize.org provides comprehensive, first-hand information about the Nobel Prize and Nobel Laureates in Physics, Chemistry, Physiology or Medicine, Literature and Peace starting in 1901, as well as the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel and the Economics Laureates starting in 1969.
I would like to thanks, the ‘Nobel Foundation’ — had created exactly this dataset till this year — “2016 Nobel Prize Announcements”. Why, I have come to Nobel Foundation, because there wasn’t any other place that had the data as nicely structured and informative.
Again, The Nobel Foundation helped out by supplying a dataset with all of the Nobel Laureates from 1901 to 2016 and every effort to give Nobel Prizes for the greatest benefit to mankind.
Now, We have to load some required packages, to work on!
# Remove Objects from a Specified Environment.
rm(list = ls())
if (!require("caret") | !require("cluster") | !require("plyr") | !require("ggplot2") | !require("ggfortify") | !require("jsonlite") | !require("mixtools") | !require("stringi") | !require("dplyr")) {
stop('Some required package(s) is not installed!')
} else {
library("caret") # 1. classification and regression Training.
library("cluster") # 2. "Finding Groups in Data": cluster analysis extended.
library("plyr") # 3. plyr is a Tools for splitting, applying and combining data.
library("ggplot2") # 4. ggplot2 is a plotting system for R, based on the grammar of graphics.
library("ggfortify") # 5. Data visualization tools for statistical analysis results.
library("jsonlite") # 6. jsonlite is a Robust, High Performance JSON Parser and Generator for R.
library("mixtools") # 7. mixtools is a tool for analyzing Finite Mixture Models.
library("stringi") # 8. string processing package for R.
library("dplyr") # 9. dplyr is a grammar of data manipulation.
# -----> All are needed to do the analysis.
}
# Loading required package: caret
# Loading required package: lattice
# Loading required package: ggplot2
# Loading required package: cluster
# Loading required package: plyr
# Loading required package: ggfortify
# Loading required package: jsonlite
# Loading required package: mixtools
# mixtools package, version 1.0.4, Released 2016-01-11
# This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
# Loading required package: stringi
# Loading required package: dplyr
#
# Attaching package: 'dplyr'
# The following objects are masked from 'package:plyr':
#
# arrange, count, desc, failwith, id, mutate, rename, summarise,
# summarize
# The following objects are masked from 'package:stats':
#
# filter, lag
# The following objects are masked from 'package:base':
#
# intersect, setdiff, setequal, union
options(scipen = 1) # Turn off scientific notations for numbers.
if (!dir.exists("./gold")) {
dir.create("./gold")
}
Nobelprize.org offers open data to developers in two ways: An API and as Linked Data. The data is free to use and contains information about who has been awarded the Nobel Prize, when, in what prize category and the motivation, as well as basic information about the Nobel Laureates such as birth data and the affiliation at the time of the award. The data is regularly updated as the information on Nobelprize.org is updated, including at the time of announcements of new Laureates. The REST based API described here provides different ways to list and search the data. The result is returned as JSON or CSV.
# -> We use the Nobel Prize API to fetch Nobel Laureates Data in JSON format.
theData <- "http://api.nobelprize.org/v1/laureate.json"
thePrize <- "http://api.nobelprize.org/v1/prize.json"
theCountry <- "http://api.nobelprize.org/v1/country.json"
get.data <- function() {
print("Hitting Nobel API...")
if (!"laureate.json" %in% dir("./gold/")) {
print("getting data for laureates, please wait...")
download.file(theData, destfile = "gold/laureate.json")
}
if (!"prize.json" %in% dir("./gold/")) {
print("getting data for prizes, please wait...")
download.file(thePrize, destfile = "gold/prize.json")
}
if (!"country.json" %in% dir("./gold/")) {
print("getting data for countries, please wait...")
download.file(theCountry, destfile = "gold/country.json")
}
}
# calling it.
get.data()
# [1] "Hitting Nobel API..."
# -> Assigning to a variable — 'nobels', 'nobelsPrize' and 'nobelsCountry', respectively.
nobels <- fromJSON(theData)
nobelsPrize <- fromJSON(thePrize)
nobelsCountry <- fromJSON(theCountry)
# -> Print the names attribute of the 'nobels', 'nobelsPrize' and 'nobelsCountry' data set.
names(nobels)
# [1] "laureates"
names(nobelsPrize)
# [1] "prizes"
names(nobelsCountry)
# [1] "countries"
# -> Primary observations
names(nobels$laureates)
# [1] "id" "firstname" "surname"
# [4] "born" "died" "bornCountry"
# [7] "bornCountryCode" "bornCity" "diedCountry"
# [10] "diedCountryCode" "diedCity" "gender"
# [13] "prizes"
names(nobels$laureates$prizes[[1]])
# [1] "year" "category" "share" "motivation"
# [5] "affiliations"
names(nobelsPrize$prizes)
# [1] "year" "category" "laureates"
# [4] "overallMotivation"
names(nobelsCountry$countries)
# [1] "name" "code"
Note: The analysis, that use prizes may count some laureates twice. However, there are only four such individuals, which makes little difference to these charts.
Now, I am going to work for the analysis of acquired data from the Nobel API:
We can retrieve those laureates who won more than one prize by selecting records, where nobels$laureates$prizes
has more than one row.
multi <- which(sapply(nobels$laureates$prizes, function(x) nrow(x)) > 1)
winners <- nobels$laureates[multi, c("firstname", "surname", "born", "bornCountry")]
write.table(winners, "./winners.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")
Result of Analysis is:-
Four individuals have won two prizes:
Counting up prizes by gender reveals the huge gender gap in Nobel Laureates.
gender <- as.data.frame(table(nobels$laureates$gender), stringsAsFactors = FALSE)
ggplot(gender) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue") +
theme_bw() +
labs(x = "Gender", y = "Count", title = "All Nobel Prizes by Gender")
Result of Analysis:-
Nobel Prize Awarded to Women!
This is an analysis of Nobel Laureates in Physics, Chemistry, Economic Sciences, Physiology or Medicine, Literature and Peace — by gender (male/female).
cnt <- sapply(nobels$laureates$prizes, function(x) nrow(x))
prizes <- ldply(nobels$laureates$prizes, as.data.frame)
prizes$id <- rep(nobels$laureates$id, cnt)
prizes$gender <- rep(nobels$laureates$gender, cnt)
pg <- as.data.frame(table(prizes$category, prizes$gender), stringsAsFactors = FALSE)
ggplot(pg) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue") +
theme_bw() +
facet_grid(Var2 ~ .) +
labs(x = "Category", y = "Count", title = "All Nobel Prizes by Gender and Category")
write.table(pg, "./prizes-gaps.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")
Is there any indication of an increase in female laureates over time?
p4 <- as.data.frame(table(prizes$year, prizes$gender), stringsAsFactors = FALSE)
colnames(p4) <- c("year", "gender", "Freq")
p4.1 <- mutate(group_by(p4, gender), cumsum = cumsum(Freq))
ggplot(subset(p4.1, gender != "org")) + geom_point(aes(year, log(cumsum), color = gender)) +
theme_bw() +
scale_x_discrete(breaks = seq(1900, 2016, 10)) +
scale_color_manual(values = c("darkorange", "skyblue")) +
labs(x = "Year", y = "log(cumulative sum) of laureates",
title = "Cumulative Sum of Nobel Laureates by Gender over Time")
write.table(p4, "./gender-by-year.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")
Note: There is some indication that since about 1975, more women have won prizes than in the preceding years.
p5 <- as.data.frame(table(prizes$year, prizes$category, prizes$gender), stringsAsFactors = FALSE)
colnames(p5) <- c("year", "category", "gender", "Freq")
p5.1 <- mutate(group_by(p5, category, gender), cumsum = cumsum(Freq))
ggplot(subset(p5.1, gender != "org")) + geom_point(aes(year, log(cumsum), color = gender)) +
facet_grid(category ~ .) +
theme_bw() +
scale_x_discrete(breaks = seq(1900, 2016, 10)) +
scale_color_manual(values = c("darkorange", "skyblue")) +
labs(x = "Year", y = "log(cumulative sum) of laureates",
title = "Cumulative Sum of Nobel Laureates by Gender & Category over Time")
write.table(p5, "./subset-by-category.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")
Conclusion of Gender Analysis: There is some indication that since about 1975, more women have won prizes in medicine and peace than in the preceding years. The rate of awards to women for literature also rises after about 1990.
To date, only one woman has won the prize for economics, two women have won for physics and four have won for chemistry.
How old are the laureates? The data does not include the date that prizes were awarded, so for those cases where birth date is available, we calculate age at the end of the year in which laureates won their prize. Median age is indicated by a point in this plot.
prizes$born <- rep(nobels$laureates$born, cnt)
prizes$age <- as.Date(paste(prizes$year, "12-31", sep = "-"), "%Y-%m-%d") - as.Date(prizes$born, "%Y-%m-%d")
ggplot(prizes[!is.na(prizes$category), ]) + geom_violin(aes(category, as.numeric(age) / 365), fill = "skyblue") +
theme_bw() +
stat_summary(aes(category, as.numeric(age) / 365), fun.y = "median", geom = "point") +
labs(x = "Category", y = "Age (years)", title = "Age Distribution of Nobel Laureates by Category")
Note: Median age is over 50 for all categories; physics laureates have the youngest median and economics the oldest. The peace prize is skewed by a recent very young “outlier” — Malala Yousafzai (Age-17).
Is there a change in age at which prizes were awarded over time?
ggplot(prizes[!is.na(prizes$category), ]) + geom_point(aes(year, as.numeric(age)/365)) +
theme_bw() +
geom_smooth(aes(year, as.numeric(age)/365, group = 1)) +
facet_wrap(~category) +
scale_x_discrete(breaks = seq(1900, 2016, 25)) +
labs(x = "Year", y = "Age(years) at end of year", title = "Age of Nobel Laureates Over Time by Category")
Note: There is a downward trend in age for the peace prize, again somewhat skewed by a young outlier. All other categories show an upward trend in age. This is especially pronounced for physics and chemistry, where laureates were much younger in the early part of the 20th century.
It is possible to calculate the number of prizes awarded by country of birth, but do the absolute numbers really tell us anything?
Clearly there should be some kind of correction as larger, wealthier countries might be expected to produce more laureates. However, this is not straightforward - what correction should be applied? Using current population, for example, might generate a bias towards very small countries with only one or two laureates. In addition, the characteristics of countries (population, GDP, borders) change over time. We will therefore simply present the absolute numbers and let readers draw their own conclusions regarding the “success” of individual countries. This chart uses ISO 3166 2-letter country codes.
prizes$country <- rep(nobels$laureates$bornCountryCode, cnt)
p6 <- as.data.frame(table(prizes$category, prizes$country), stringsAsFactors = FALSE)
ggplot(p6) + geom_bar(aes(Var2, Freq, fill = Var1), stat = "identity", position = "stack") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, size = rel(0.82))) +
labs(x = "Country Code", y = "Count", title = "All Nobel Prizes by Country and Category") +
scale_fill_manual(values = c("#ffffcc", "#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494"),
name = "Category")
write.table(p6, "./countries-by-category.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")
prizes$country <- rep(nobels$laureates$bornCountryCode, cnt)
p8 <- as.data.frame(table(prizes$gender, prizes$country), stringsAsFactors = FALSE)
ggplot(p8) + geom_bar(aes(Var2, Freq, fill = Var1), stat = "identity", position = "stack") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, size = rel(0.82))) +
labs(x = "Country Code", y = "Count", title = "All Nobel Prizes by Country and Gender") +
scale_fill_manual(values = c("#41b6c4", "#ffd800", "#253494"), name = "Gender")