1. The most Nobel of Prizes
The Nobel Prize is perhaps the worlds most well known scientific award. Except for the honor, prestige and substantial prize money the recipient also gets a gold medal showing Alfred Nobel (1833 - 1896) who established the prize. Every year it’s given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the Prize was very Eurocentric and male-focused, but nowadays it’s not biased in any way whatsoever. Surely. Right?
Well, we’re going to find out! The Nobel Foundation has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016. Let’s load it in and take a look.
Task 1: Instructions
Load the required libraries and the Nobel Prize dataset.
- Load the
tidyverse
library.
- Use
read_csv
(not read.csv
) to read in datasets/nobel.csv and save it into nobel
.
- Show the
head
of nobel
, that is, the first couple of prize winners.
Make sure to use read_csv (with an underscore) to read in the data. The read.csv
function, which is built into R, has a number of problems which the new read_csv
function avoids.
Good to know
This Project assumes you have used the dplyr
and ggplot2
packages and that you are familiar with the pipe operator (%>%
). Before taking on this Project, we recommend that you have completed the course Introduction to the Tidyverse.
RStudio has created some very helpful cheat sheets, including two that will be helpful for this Project: Data Wrangling and Data Visualization with ggplot2
. We recommend that you keep them open in a separate tab to make it easy to refer to them.
2. So, who gets the Nobel Prize?
Just looking at the first couple of prize winners, or Nobel laureates as they are also called, we already see a celebrity: Wilhelm Conrad Röntgen, the guy who discovered X-rays. And actually, we see that all of the winners in 1901 were guys that came from Europe. But that was back in 1901, looking at all winners in the dataset, from 1901 to 2016, which sex and which country is the most commonly represented? *** (For country, we will use the birth_country
of the winner, as the organization_country
is NA
for all shared Nobel Prizes.)
Task 2: Instructions
Count up the Nobel Prizes. Also, split by sex
and birth_country
.
- Count and display the number of rows/prizes using the
count()
function.
- Count and display the number of rows/prizes, grouped by
sex
.
- Count the number of rows/prizes, grouped by
birth_country
. Arrange the result by no. prizes in descending order and display the first 20 rows using head(20)
. For how to use the group_by
function to group by a column check out Group Cases in the dplyr
cheat sheet. For how to arrange
rows, take a look at Arrange Cases in the same cheat sheet.
# Counting the number of (possibly shared) Nobel Prizes handed
# out between 1901 and 2016
nobel %>% count()
# Counting the number of prizes won by male and female recipients.
nobel %>%
count(sex)
# Counting the number of prizes won by different nationalities.
nobel %>%
group_by(birth_country) %>% count() %>% arrange(desc(n)) %>% head(20)
3. USA dominance
Not so surprising perhaps: the most common Nobel laureate between 1901 and 2016 was a man born in the United States of America. But in 1901 all the laureates were European. When did the USA start to dominate the Nobel Prize charts?
Task 3: Instructions
Calculate the proportion of USA born winners per decade starting from the nobel
dataset and put the result into prop_usa_winners
.
- Add a
usa_born_winner
column to nobel
, where the value is TRUE
when birth_country
is "United States of America"
.
- Add a
decade
column to nobel
showing the decade the prize was awarded (1953
should become 1950
, for example).
- Group by
decade
and use summarize
to add the column proportion
to nobel
. proportion
should contain the proportion of usa_born_winners
for each decade.
- Display / print out
prop_usa_winners
.
You can use mutate
for the first two bullet points.
To calculate the proportion of TRUE
values you can use the mean
function. If the column includes NA
values, you would have to use the na.rm
argument. Here’s how you could use mean
together with summarize
:
my_data %>% group_by(my_categorical_variable) %>% summarize(proportion = mean(is_winner, na.rm = TRUE))
# Calculating the proportion of USA born winners per decade
prop_usa_winners <- nobel %>%
mutate(usa_born_winner = birth_country == "United States of America")
# Display the proportions of USA born winners per decade
prop_usa_winners <- prop_usa_winners %>%
mutate(decade = floor(year / 10) * 10) %>%
group_by(decade) %>%
summarize(proportion = mean(usa_born_winner, na.rm = TRUE))
prop_usa_winners
4. USA dominance, visualized
A table is OK, but to see when the USA started to dominate the Nobel charts we need a plot!
Task 4: Instructions
Plot the proportion of USA born winners per decade.
- Use ggplot to plot prop_usa_winners with decade on the x-axis and proportion on the y-axis as a line-and-dot-plot. That is, add both geom_line() and geom_point().
- Fix the y-scale to that it shows percentages, its limits go from 0.0 to 1.0, and extra spacing is removed above and below 0.0 and 1.0.
To change the y-axis use scale_y_continuous and set the labels, limits, and expand arguments. Check the ggplot2 documentation for how to use limits and expand. Here is a StackOverflow question that shows how to set labels correctly.
# Setting the size of plots in this notebook
options(repr.plot.width=7, repr.plot.height=4)
# Plotting USA born winners
ggplot(data = prop_usa_winners, aes(x = decade, y = proportion))+
geom_line()+
geom_point()+
scale_y_continuous(labels = scales::percent,
limits = 0:1, expand = c(0,0))

5. What is the gender of a typical Nobel Prize winner?
So the USA became the dominating winner of the Nobel Prize first in the 1930s and has kept the leading position ever since. But one group that was in the lead from the start, and never seems to let go, are men. Maybe it shouldn’t come as a shock that there is some imbalance between how many male and female prize winners there are, but how significant is this imbalance? And is it better or worse within specific prize categories like physics, medicine, literature, etc.?
Task 5: Instructions
Plot the proportion of female laureates by decade split by prize category.
- Add
female_winner
column, where the value is TRUE
when sex
is "Female"
.
- Add the column
decade
showing the decade the prize was awarded (1953
should become 1950
, for example).
- Group by
decade
and category
and summarize the proportion of female_winner
into the proportion
column.
- Copy and paste your ggplot code from task 4, except plot the
prop_female_winners
data and map the category variable to the color
parameter.
This task can be solved by copying and modifying the code from task 3 and 4.
# Calculating the proportion of female laureates per decade
prop_female_winners <- nobel %>%
mutate(female_winner = sex == "Female") %>%
mutate(decade = floor(year/10)*10) %>%
group_by(decade, category) %>%
summarize(proportion = mean(female_winner, na.rm = TRUE))
# Plotting the proportion of female laureates per decade
ggplot(data = prop_female_winners, aes(x = decade, y = proportion, color = category ))+
geom_line()+
geom_point()+
scale_y_continuous(labels = scales::percent,
limits = 0:1, expand = c(0,0))

6. The first woman to win the Nobel Prize
The plot above is a bit messy as the lines are overplotting. But it does show some interesting trends and patterns. Overall the imbalance is pretty large with physics, economics, and chemistry having the largest imbalance. Medicine has a somewhat positive trend, and since the 1990s the literature prize is also now more balanced. The big outlier is the peace prize during the 2010s, but keep in mind that this just covers the years 2010 to 2016.
Given this imbalance, who was the first woman to receive a Nobel Prize? And in what category?
Task 6: Instructions
Extract and display the row showing the first woman to win a Nobel Prize.
- Use
filter
to filter away all non-"Female"
laureates.
- Use
top_n
to pick out the row with the earliest year.
top_n(x, n, wt)
is a useful function that takes a table x
and picks out the top n
rows as ordered by the column wt
. By default top_n
sort highest-to-lowest so to pick out five best offers in the bargain bin, you would have to use desc()
:
bargain_bin %>% top_n(5, desc(price))
nobel %>%
filter(sex == "Female") %>%
top_n(1, desc(year))
7. Repeat laureates
For most scientists/writers/activists a Nobel Prize would be the crowning achievement of a long career. But for some people, one is just not enough, and there are few that have gotten it more than once. Who are these lucky few? (Having won no Nobel Prize myself, I’ll assume it’s just about luck.)
Task 7: Instructions
Extract and display the names of repeat Nobel Prize winners.
- Use
count
to count the number of wins grouped by full_name
.
- Filter away all winners that “only” won one time.
# Selecting the laureates that have received 2 or more prizes.
nobel %>%
group_by(full_name) %>%
count() %>%
filter(n>1) %>% arrange(desc(n))
8. How old are you when you get the prize?
The list of repeat winners contains some illustrious names! We again meet Marie Curie, who got the prize in physics for discovering radiation and in chemistry for isolating radium and polonium. John Bardeen got it twice in physics for transistors and superconductivity, Frederick Sanger got it twice in chemistry, and Linus Carl Pauling got it first in chemistry and later in peace for his work in promoting nuclear disarmament. We also learn that organizations also get the prize as both the Red Cross and the UNHCR have gotten it twice.
But how old are you generally when you get the prize?
Task 8: Instructions
Calculate and plot the age of each winner when they won their Nobel Prize.
- Load the
lubridate
package (you’ll find the year()
function useful).
mutate
the nobel
table to include the column age
which should be how old people were when they got their price. Assign the resulting table to nobel_age
.
- Use
ggplot
to plot age
as a function of year
as a scatter plot (geom_point()
) with a smooth trend (geom_smooth()
).
The year()
function from lubridate takes a date and extracts the year:
dates <- as.Date( c(“1985-04-02”, “1988-07-25”)) year(dates) ## [1] 1985 1988
# Loading the lubridate package
library(lubridate)
package <U+393C><U+3E31>lubridate<U+393C><U+3E32> was built under R version 3.5.1
Attaching package: <U+393C><U+3E31>lubridate<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
date
# Calculating the age of Nobel Prize winners
nobel_age <- nobel %>%
mutate(age = year - year(birth_date))
# Plotting the age of Nobel Prize winners
ggplot(data = nobel_age, aes(x = year, y = age))+
geom_point()+
geom_smooth()

9. Age differences between prize categories
The plot above shows us a lot! We see that people use to be around 55 when they received the price, but nowadays the average is closer to 65. But there is a large spread in the laureates’ ages, and while most are 50+, some are very young.
We also see that the density of points is much high nowadays than in the early 1900s – nowadays many more of the prizes are shared, and so there are many more winners. We also see that there was a disruption in awarded prizes around the Second World War (1939 - 1945).
Let’s look at age trends within different prize categories.
Task 9: Instructions
Plot how old winners are within the different price categories.
- Use
ggplot
to plot age
as a function of year
as a scatter plot (geom_point()
) with a smooth trend (geom_smooth()
) and facet
by category
using facet_wrap
.
- Optional: Remove the confidence band in
geom_smooth
by setting se = FALSE
.
This is the same plot as in task 8, except faceted by category
.
Removing the confidence band in geom_smooth
is not strictly necessary, but the bands are not that meaningful and removing them makes the plot more focused.
If you don’t remember how facet_wrap
works then look under Faceting on the second page of the ggplot2
cheat sheet.
# Same plot as above, but faceted by the category of the Nobel Prize
ggplot(data = nobel_age, aes(x = year, y = age))+
geom_point()+
geom_smooth(se = FALSE)+
facet_wrap(~ category)

10. Oldest and youngest winners
Another plot with lots of exciting stuff going on! We see that both winners of the chemistry, medicine, and physics prize have gotten older over time. The trend is strongest for physics: the average age used to be below 50, and now it’s almost 70. Literature and economics are more stable, and we also see that economics is a newer category. But peace shows an opposite trend where winners are getting younger!
In the peace category we also a winner around 2010 that seems exceptionally young. This begs the questions, who are the oldest and youngest people ever to have won a Nobel Prize?
Task 10: Instructions
Pick out the rows of the oldest and the youngest winner of a Nobel Prize.
- Use
top_n
to pick out and display the row of the oldest winner.
- Use
top_n
to pick out and display the row of the youngest winner. Remember that you can use desc to reverse the sorting:
The most expensive item
bargain_bin %>% top_n(1, price)
The cheapest item
bargain_bin %>% top_n(1, desc(price))
# The oldest winner of a Nobel Prize as of 2016
nobel_age %>% top_n(1, age)
# The youngest winner of a Nobel Prize as of 2016
nobel_age %>% top_n(1, desc(age))
11. You get a prize!
Hey! You get a prize for making it to the very end of this notebook! It might not be a Nobel Prize, but I made it myself in paint so it should count for something. But don’t despair, Leonid Hurwicz was 90 years old when he got his prize, so it might not be too late for you. Who knows.
Before you leave, what was again the name of the youngest winner ever who in 2014 got the prize for “[her] struggle against the suppression of children and young people and for the right of all children to education”?
Task 11: Instructions
- Assign the name of the youngest winner of a Nobel Prize to
youngest_winner
. The first name will suffice.
If you want to know more The Nobel Prize dataset is rich and there and this Project just scratched the surface – there is much more to explore! After you have completed this Project you can download it and continue exploring on your own computer! To do that you will have to install Jupyter notebooks with support for R. Here are instructions for how to install the Jupyter Notebook interface and here are instructions for how to add support for R. Good luck!
# The name of the youngest winner of the Nobel Prize as of 2016
youngest_winner <- "Malala Yousafzai"
youngest_winner
[1] "Malala Yousafzai"
