Executive Summary

This advance report provides an analysis of Nobel Prize Winners in their fields since the very first Nobel Prize in 1901 to 2016. This document illustrates some ways to analyse data from the Nobel Prize API using the R programming language. The analysis includes multiple prize winners, gender gap, gender by category — “chemistry, economics, literature, medicine or physiology, peace, and physics”; gender over time, nobel prize share, distribution by age, and distribution by countries etc. Some adavnce analysis like — clustering, classifications, finite mixture models and principal component analysis, also included to get more inference related to Nobel Prize and it’s Laureates.

About Nobel Foundation

Thankfully, the Nobel Foundation & Nobelprize.org - is a registered trademark, and is produced, managed and maintained by Nobel Media, had created exactly the database of information for every Nobel Prize since 1901, including the Nobel Laureate’s biographies, Nobel Lectures, interviews, photos, articles, video clips, and press releases. Nobelprize.org provides comprehensive, first-hand information about the Nobel Prize and Nobel Laureates in Physics, Chemistry, Physiology or Medicine, Literature and Peace starting in 1901, as well as the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel and the Economics Laureates starting in 1969.

I would like to thanks, the ‘Nobel Foundation’ — had created exactly this dataset till this year — “2016 Nobel Prize Announcements”. Why, I have come to Nobel Foundation, because there wasn’t any other place that had the data as nicely structured and informative.

Again, The Nobel Foundation helped out by supplying a dataset with all of the Nobel Laureates from 1901 to 2016 and every effort to give Nobel Prizes for the greatest benefit to mankind.

Getting Started

Now, We have to load some required packages, to work on!

# Remove Objects from a Specified Environment.
rm(list = ls())

if (!require("caret") | !require("cluster") | !require("plyr") | !require("ggplot2") | !require("ggfortify") | !require("jsonlite") | !require("mixtools") | !require("stringi") | !require("dplyr")) {
  stop('Some required package(s) is not installed!')
} else {
  library("caret")      # 1. classification and regression Training.
  library("cluster")    # 2. "Finding Groups in Data": cluster analysis extended.
  library("plyr")       # 3. plyr is a Tools for splitting, applying and combining data.
  library("ggplot2")    # 4. ggplot2 is a plotting system for R, based on the grammar of graphics.
  library("ggfortify")  # 5. Data visualization tools for statistical analysis results.
  library("jsonlite")   # 6. jsonlite is a Robust, High Performance JSON Parser and Generator for R.
  library("mixtools")   # 7. mixtools is a tool for analyzing Finite Mixture Models.
  library("stringi")    # 8. string processing package for R.
  library("dplyr")      # 9. dplyr is a grammar of data manipulation.
  # -----> All are needed to do the analysis.
}
# Loading required package: caret
# Loading required package: lattice
# Loading required package: ggplot2
# Loading required package: cluster
# Loading required package: plyr
# Loading required package: ggfortify
# Loading required package: jsonlite
# Loading required package: mixtools
# mixtools package, version 1.0.4, Released 2016-01-11
# This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
# Loading required package: stringi
# Loading required package: dplyr
# 
# Attaching package: 'dplyr'
# The following objects are masked from 'package:plyr':
# 
#     arrange, count, desc, failwith, id, mutate, rename, summarise,
#     summarize
# The following objects are masked from 'package:stats':
# 
#     filter, lag
# The following objects are masked from 'package:base':
# 
#     intersect, setdiff, setequal, union

1.0 Basic Settings

options(scipen = 1)     # Turn off scientific notations for numbers.

1.1 Directory Creation

if (!dir.exists("./gold")) {
  dir.create("./gold")
}

2.0 Getting the Data

Nobelprize.org offers open data to developers in two ways: An API and as Linked Data. The data is free to use and contains information about who has been awarded the Nobel Prize, when, in what prize category and the motivation, as well as basic information about the Nobel Laureates such as birth data and the affiliation at the time of the award. The data is regularly updated as the information on Nobelprize.org is updated, including at the time of announcements of new Laureates. The REST based API described here provides different ways to list and search the data. The result is returned as JSON or CSV.

# -> We use the Nobel Prize API to fetch Nobel Laureates Data in JSON format.
theData <- "http://api.nobelprize.org/v1/laureate.json"
thePrize <- "http://api.nobelprize.org/v1/prize.json"
theCountry <- "http://api.nobelprize.org/v1/country.json"

2.1 Working with a ‘get.data’ function to get all the required data…

get.data <- function() {
  print("Hitting Nobel API...")
  if (!"laureate.json" %in% dir("./gold/")) {
    print("getting data for laureates, please wait...")
    download.file(theData, destfile = "gold/laureate.json")
  }
  if (!"prize.json" %in% dir("./gold/")) {
    print("getting data for prizes, please wait...")
    download.file(thePrize, destfile = "gold/prize.json")
  }
  if (!"country.json" %in% dir("./gold/")) {
    print("getting data for countries, please wait...")
    download.file(theCountry, destfile = "gold/country.json")
  }
}
# calling it.
get.data()
# [1] "Hitting Nobel API..."

# -> Assigning to a variable — 'nobels', 'nobelsPrize' and 'nobelsCountry', respectively.
nobels <- fromJSON(theData)
nobelsPrize <- fromJSON(thePrize)
nobelsCountry <- fromJSON(theCountry)

# -> Print the names attribute of the 'nobels', 'nobelsPrize' and 'nobelsCountry' data set.
names(nobels)
# [1] "laureates"
names(nobelsPrize)
# [1] "prizes"
names(nobelsCountry)
# [1] "countries"

# -> Primary observations
names(nobels$laureates)
#  [1] "id"              "firstname"       "surname"        
#  [4] "born"            "died"            "bornCountry"    
#  [7] "bornCountryCode" "bornCity"        "diedCountry"    
# [10] "diedCountryCode" "diedCity"        "gender"         
# [13] "prizes"
names(nobels$laureates$prizes[[1]])
# [1] "year"         "category"     "share"        "motivation"  
# [5] "affiliations"
names(nobelsPrize$prizes)
# [1] "year"              "category"          "laureates"        
# [4] "overallMotivation"
names(nobelsCountry$countries)
# [1] "name" "code"

Note: The analysis, that use prizes may count some laureates twice. However, there are only four such individuals, which makes little difference to these charts.

3. Analysis

Now, I am going to work for the analysis of acquired data from the Nobel API:

3.1 Multiple Prize Winners

We can retrieve those laureates who won more than one prize by selecting records, where nobels$laureates$prizes has more than one row.

multi <- which(sapply(nobels$laureates$prizes, function(x) nrow(x)) > 1)
winners <- nobels$laureates[multi, c("firstname", "surname", "born", "bornCountry")]
write.table(winners, "./winners.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")

Result of Analysis is:-

Four individuals have won two prizes:

Marie Curie (physics, chemistry);
John Bardeen (physics);
Linus Pauling (chemistry, peace);
Frederick Sanger (chemistry).

3.2 Gender Gap

Counting up prizes by gender reveals the huge gender gap in Nobel Laureates.

gender <- as.data.frame(table(nobels$laureates$gender), stringsAsFactors = FALSE)
ggplot(gender) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue") + 
  theme_bw() + 
  labs(x = "Gender", y = "Count", title = "All Nobel Prizes by Gender")

Result of Analysis:-

Nobel Prize Awarded to Women!

The Nobel Prize and Prize in Economic Sciences have been awarded to women 49 times between 1901 and 2016.
Only one woman, Marie Curie, has been honoured twice, with the 1903 Nobel Prize in Physics and the 1911 Nobel Prize in Chemistry.
This means that 48 women in total have been awarded the Nobel Prize between 1901 and 2016.
23 are organizations and rest of laureates are male (839).

3.3 Gender by Category

This is an analysis of Nobel Laureates in Physics, Chemistry, Economic Sciences, Physiology or Medicine, Literature and Peace — by gender (male/female).

cnt <- sapply(nobels$laureates$prizes, function(x) nrow(x))
prizes <- ldply(nobels$laureates$prizes, as.data.frame)
prizes$id <- rep(nobels$laureates$id, cnt)
prizes$gender <- rep(nobels$laureates$gender, cnt)
pg <- as.data.frame(table(prizes$category, prizes$gender), stringsAsFactors = FALSE)
ggplot(pg) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue") + 
  theme_bw() + 
  facet_grid(Var2 ~ .) + 
  labs(x = "Category", y = "Count", title = "All Nobel Prizes by Gender and Category")

write.table(pg, "./prizes-gaps.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")

3.4 Gender over Time

Is there any indication of an increase in female laureates over time?

p4 <- as.data.frame(table(prizes$year, prizes$gender), stringsAsFactors = FALSE)
colnames(p4) <- c("year", "gender", "Freq")
p4.1 <- mutate(group_by(p4, gender), cumsum = cumsum(Freq))
ggplot(subset(p4.1, gender != "org")) + geom_point(aes(year, log(cumsum), color = gender)) +
  theme_bw() + 
  scale_x_discrete(breaks = seq(1900, 2016, 10)) + 
  scale_color_manual(values = c("darkorange", "skyblue")) + 
  labs(x = "Year", y = "log(cumulative sum) of laureates", 
       title = "Cumulative Sum of Nobel Laureates by Gender over Time")

write.table(p4, "./gender-by-year.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")

Note: There is some indication that since about 1975, more women have won prizes than in the preceding years.

3.5 What if we subset by category?

p5 <- as.data.frame(table(prizes$year, prizes$category, prizes$gender), stringsAsFactors = FALSE)
colnames(p5) <- c("year", "category", "gender", "Freq")
p5.1 <- mutate(group_by(p5, category, gender), cumsum = cumsum(Freq))
ggplot(subset(p5.1, gender != "org")) + geom_point(aes(year, log(cumsum), color = gender)) + 
  facet_grid(category ~ .) + 
  theme_bw() + 
  scale_x_discrete(breaks = seq(1900, 2016, 10)) + 
  scale_color_manual(values = c("darkorange", "skyblue")) + 
  labs(x = "Year", y = "log(cumulative sum) of laureates", 
       title = "Cumulative Sum of Nobel Laureates by Gender & Category over Time")

write.table(p5, "./subset-by-category.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")

Conclusion of Gender Analysis: There is some indication that since about 1975, more women have won prizes in medicine and peace than in the preceding years. The rate of awards to women for literature also rises after about 1990.

To date, only one woman has won the prize for economics, two women have won for physics and four have won for chemistry.

3.6 Nobel Prize Share

Nobel foundation sometimes giving or distributing prizes between more than one laureates. Prizes may be shared by no more than three people. How often has this occurred?

share <- as.data.frame(table(prizes$year, prizes$category), stringsAsFactors = FALSE)
colnames(share) <- c("year", "category", "Freq")
shared <- as.data.frame(table(share$Freq), stringsAsFactors = FALSE)
ggplot(shared[shared$Var1 != 0, ]) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue") + 
  theme_bw() + 
  labs(x = "Number of laureates", y = "Count", title = "Laureates per Nobel Prize")

write.table(share, "./shared-by-category.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")

Note: Individual winners are most common.

3.7 Nobel Prize Share by Category or Fields

Are there any notable differences in prize sharing between fields?

sharecat <- as.data.frame(table(share$category, share$Freq), stringsAsFactors = FALSE)
colnames(sharecat) <- c("category", "share", "Freq")
ggplot(subset(sharecat, share > 0)) + geom_bar(aes(category, Freq), stat = "identity", fill = "skyblue") + 
  theme_bw() + 
  facet_grid(share ~ .) + labs(x = "Category", y = "Count", title = "Laureates per Nobel Prize by Category")

write.table(sharecat, "./counted-by-category.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")

Note: Individual winners are more common in all categories, notably literature and peace. In the sciences, two or three winners are roughly equally common; chemistry stands out with more individual winners than medicine or physics. Literature laureates are maximum on the basis of individual winners.

3.8 Ages and Nobel Laureates

How old are the laureates? The data does not include the date that prizes were awarded, so for those cases where birth date is available, we calculate age at the end of the year in which laureates won their prize. Median age is indicated by a point in this plot.

3.8.1 Ages Distribution by Category

prizes$born <- rep(nobels$laureates$born, cnt)
prizes$age <- as.Date(paste(prizes$year, "12-31", sep = "-"), "%Y-%m-%d") - as.Date(prizes$born, "%Y-%m-%d")
ggplot(prizes[!is.na(prizes$category), ]) + geom_violin(aes(category, as.numeric(age) / 365), fill = "skyblue") + 
  theme_bw() + 
  stat_summary(aes(category, as.numeric(age) / 365), fun.y = "median", geom = "point") + 
  labs(x = "Category", y = "Age (years)", title = "Age Distribution of Nobel Laureates by Category")

Note: Median age is over 50 for all categories; physics laureates have the youngest median and economics the oldest. The peace prize is skewed by a recent very young “outlier” — Malala Yousafzai (Age-17).

3.8.2 Ages Distribution by Year Awarded

Is there a change in age at which prizes were awarded over time?

ggplot(prizes[!is.na(prizes$category), ]) + geom_point(aes(year, as.numeric(age)/365)) + 
  theme_bw() + 
  geom_smooth(aes(year, as.numeric(age)/365, group = 1)) + 
  facet_wrap(~category) + 
  scale_x_discrete(breaks = seq(1900, 2016, 25)) + 
  labs(x = "Year", y = "Age(years) at end of year", title = "Age of Nobel Laureates Over Time by Category")

Note: There is a downward trend in age for the peace prize, again somewhat skewed by a young outlier. All other categories show an upward trend in age. This is especially pronounced for physics and chemistry, where laureates were much younger in the early part of the 20th century.

3.9 Countries Wise Distribution Analysis

3.9.1 Countries by Category

It is possible to calculate the number of prizes awarded by country of birth, but do the absolute numbers really tell us anything?

Clearly there should be some kind of correction as larger, wealthier countries might be expected to produce more laureates. However, this is not straightforward - what correction should be applied? Using current population, for example, might generate a bias towards very small countries with only one or two laureates. In addition, the characteristics of countries (population, GDP, borders) change over time. We will therefore simply present the absolute numbers and let readers draw their own conclusions regarding the “success” of individual countries. This chart uses ISO 3166 2-letter country codes.

prizes$country <- rep(nobels$laureates$bornCountryCode, cnt)
p6 <- as.data.frame(table(prizes$category, prizes$country), stringsAsFactors = FALSE)
ggplot(p6) + geom_bar(aes(Var2, Freq, fill = Var1), stat = "identity", position = "stack") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, size = rel(0.82))) + 
  labs(x = "Country Code", y = "Count", title = "All Nobel Prizes by Country and Category") + 
  scale_fill_manual(values = c("#ffffcc", "#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494"),
                    name = "Category")

write.table(p6, "./countries-by-category.txt", sep="\t", row.names = FALSE, fileEncoding = "UTF-8")

3.9.2 Countries by Gender or Organisation

prizes$country <- rep(nobels$laureates$bornCountryCode, cnt)
p8 <- as.data.frame(table(prizes$gender, prizes$country), stringsAsFactors = FALSE)
ggplot(p8) + geom_bar(aes(Var2, Freq, fill = Var1), stat = "identity", position = "stack") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, size = rel(0.82))) + 
  labs(x = "Country Code", y = "Count", title = "All Nobel Prizes by Country and Gender") + 
  scale_fill_manual(values = c("#41b6c4", "#ffd800", "#253494"), name = "Gender")

Nobel Data Analysis and Modelling: An Advance Report

Prabhat Kumar — CEO, Founder & Data Scientist. http://prabhatkumar.org/

10/16/2016