Executive Summary

This report provides an analysis of Nobel Prize Winners in their fields since the very first Nobel Prize in 1901. This document illustrates some ways to analyse data from the Nobel Prize API using the R programming language. The analysis includes multiple prize winners, gender gap, gender by category - Chemestry/Economics/Literature/Medicine and Physiology/Peace/Physics, gender over time, nobel prize share, distribution by age, and distribution by countries etc.

About Nobel Foundation

Thankfully, the Nobel Foundation & Nobelprize.org - is a registered trademark, and is produced, managed and maintained by Nobel Media, had created exactly the database of information for every Nobel Prize since 1901, including the Nobel Laureate’s biographies, Nobel Lectures, interviews, photos, articles, video clips, and press releases. Nobelprize.org provides comprehensive, first-hand information about the Nobel Prize and Nobel Laureates in Physics, Chemistry, Physiology or Medicine, Literature and Peace starting in 1901, as well as the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel and the Economics Laureates starting in 1969.

I would like to thanks, the ‘Nobel Foundation’ — had created exactly this dataset right before the 2016 Nobel Prize Announcements. Why, I have come to Nobel Foundation, because there wasn’t any other place that had the data as nicely structured and informative.

After, 2016 Nobel Prize Announcements — I “only” needed to add the prizes for 2016. Again the Nobel Foundation helped out by supplying a dataset with all of the Nobel Laureates from 1901.

Getting Started

Now, We have to load some required packages, to work on!

if (!require("jsonlite") | !require("ggplot2") | !require("plyr") | !require("dplyr") | !require("xtable")) {
  stop('Some required package(s) is not installed!')
} else {
  library("jsonlite")
  library("ggplot2")
  library("plyr")
  library("dplyr")
  library("xtable")
}
## Loading required package: jsonlite
## Warning: package 'jsonlite' was built under R version 3.2.5
## Loading required package: ggplot2
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.2.5
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.2.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: xtable

1. Data Collection

Before, I even started searching online I already had an idea in my head about the data that I wanted to play with this month; thanks to Nobel Foundation, they have developed a nice API and I got it through - http://api.nobelprize.org/.

theData <- "http://api.nobelprize.org/v1/laureate.json"
nobels <- fromJSON(theData)
names(nobels)
## [1] "laureates"
names(nobels$laureates)
##  [1] "id"              "firstname"       "surname"        
##  [4] "born"            "died"            "bornCountry"    
##  [7] "bornCountryCode" "bornCity"        "diedCountry"    
## [10] "diedCountryCode" "diedCity"        "gender"         
## [13] "prizes"
names(nobels$laureates$prizes[[1]])
## [1] "year"         "category"     "share"        "motivation"  
## [5] "affiliations"

The variable nobels is a list with one named element, laureates. The variable laureates is a data frame with 13 columns, one row per laureate. The last column, prizes is a list of data frames.

Note: that analyses that use prizes may count some laureates twice. However, there are only four such individuals, which makes little difference to these charts.

2. Analysis

Now, I am going to work for the analysis of acquired data from the Nobel API:

2.1 Multiple Prize Winners

We can retrieve those laureates who won more than one prize by selecting records, where nobels$laureates$prizes has more than one row.

multi <- which(sapply(nobels$laureates$prizes, function(x) nrow(x)) > 1)
winners <- nobels$laureates[multi, c("firstname", "surname", "born", "bornCountry")]
print(xtable(winners), type = "html", comment = FALSE, include.rownames = FALSE)
firstname surname born bornCountry
Marie Curie, née Sklodowska 1867-11-07 Russian Empire (now Poland)
John Bardeen 1908-05-23 USA
Linus Carl Pauling 1901-02-28 USA
Frederick Sanger 1918-08-13 United Kingdom
Comité international de la Croix Rouge (International Committee of the Red Cross) 0000-00-00
Office of the United Nations High Commissioner for Refugees (UNHCR) 0000-00-00
# Result of Analysis.
# Four individuals have won two prizes:
#  1. Marie Curie (physics, chemistry);
#  2. John Bardeen (physics);
#  3. Linus Pauling (chemistry, peace);
#  4. Frederick Sanger (chemistry).

2.2 Gender

Counting up prizes by gender reveals the huge gender gap in Nobel Laureates.

gender <- as.data.frame(table(nobels$laureates$gender), stringsAsFactors = FALSE)
ggplot(gender) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue3") +
  theme_bw() +
  labs(x = "Gender", y = "Count", title = "All Nobel Prizes by Gender")

# Result of Analysis.
# Nobel Prize Awarded to Women!
# The Nobel Prize and Prize in Economic Sciences have been awarded to women 49 times between 1901 and 2015.
# Only one woman, Marie Curie,
# has been honoured twice, with the 1903 Nobel Prize in Physics & the 1911 Nobel Prize in Chemistry.
# This means that 48 women in total have been awarded the Nobel Prize between 1901 and 2015.
# √ 26 are organizations and rest of laureates are male.

2.2.1 Gender by Category

Categories: Chemestry/Economics/Literature/Medicine + Physiology/Peace/Physics.

cnt <- sapply(nobels$laureates$prizes, function(x) nrow(x))
prizes <- ldply(nobels$laureates$prizes, as.data.frame)
prizes$id <- rep(nobels$laureates$id, cnt)
prizes$gender <- rep(nobels$laureates$gender, cnt)
pg <- as.data.frame(table(prizes$category, prizes$gender), stringsAsFactors = FALSE)
ggplot(pg) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue3") +
  theme_bw() +
  facet_grid(Var2 ~ .) + labs(x = "Category", y = "Count", title = "All Nobel Prizes by Gender and Category")

2.2.2 Gender over Time

Is there any indication of an increase in female laureates over time?

p5 <- as.data.frame(table(prizes$year, prizes$gender), stringsAsFactors = FALSE)
colnames(p5) <- c("year", "gender", "Freq")
p5.1 <- mutate(group_by(p5, gender), cumsum = cumsum(Freq))
ggplot(subset(p5.1, gender != "org")) + geom_point(aes(year, log(cumsum), color = gender)) +
  theme_bw() +
  scale_x_discrete(breaks = seq(1900, 2015, 10)) +
  scale_color_manual(values = c("darkorange", "skyblue3")) +
  labs(x = "Year", y = "log(cumulative sum) of laureates", title = "Cumulative Sum of Nobel Laureates by Gender over Time")

Note:- There is some indication that since about 1975, more women have won prizes than in the preceding years.

What if we subset by category?

p6 <- as.data.frame(table(prizes$year, prizes$category, prizes$gender), stringsAsFactors = FALSE)
colnames(p6) <- c("year", "category", "gender", "Freq")
p6.1 <- mutate(group_by(p6, category, gender), cumsum = cumsum(Freq))
ggplot(subset(p6.1, gender != "org")) + geom_point(aes(year, log(cumsum), color = gender)) +
  facet_grid(category ~ .) +
  theme_bw() +
  scale_x_discrete(breaks = seq(1900, 2015, 10)) +
  scale_color_manual(values = c("darkorange", "skyblue3")) +
  labs(x = "Year", y = "log(cumulative sum) of laureates",
       title = "Cumulative Sum of Nobel Laureates by Gender and Category over Time")

Conclusion of Gender Analysis:- There is some indication that since about 1975, more women have won prizes in medicine and peace than in the preceding years. The rate of awards to women for literature also rises after about 1990.

To date, only one woman has won the prize for economics, two women have won for physics and four have won for chemistry.

3. Nobel Prize Share

Prizes may be shared by no more than three people. How often has this occurred?

share <- as.data.frame(table(prizes$year, prizes$category), stringsAsFactors = FALSE)
colnames(share) <- c("year", "category", "Freq")
shared <- as.data.frame(table(share$Freq), stringsAsFactors = FALSE)
ggplot(shared[shared$Var1 != 0, ]) + geom_bar(aes(Var1, Freq), stat = "identity", fill = "skyblue3") +
  theme_bw() +
  labs(x = "Number of laureates", y = "Count", title = "Laureates per Nobel Prize")

Note:- Individual winners are most common.

3.1 Nobel Prize Share by Category

Are there any notable differences in prize sharing between fields?

sharecat <- as.data.frame(table(share$category, share$Freq), stringsAsFactors = FALSE)
colnames(sharecat) <- c("category", "share", "Freq")
ggplot(subset(sharecat, share > 0)) + geom_bar(aes(category, Freq), stat = "identity", fill = "skyblue3") +
  theme_bw() +
  facet_grid(share ~ .) + labs(x = "Category", y = "Count", title = "Laureates per Nobel Prize by Category")

Note:- Individual winners are more common in all categories, notably literature and peace. In the sciences, two or three winners are roughly equally common; chemistry stands out with more individual winners than medicine or physics.

Source Code: in R.

Thanks

I am giving many heartiest thanks to the Nobel Foundation to build the public API to get and analysed the nobel laureates data. Thanks you so much.

Author — Prabhat Kumar | +91-8548-881059 | prabhat.genome@gmail.com.