1. The most Nobel of Prizes

The Nobel Prize is perhaps the worlds most well known scientific award. Except for the honor, prestige and substantial prize money the recipient also gets a gold medal showing Alfred Nobel (1833 - 1896) who established the prize. Every year it’s given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the Prize was very Eurocentric and male-focused, but nowadays it’s not biased in any way whatsoever. Surely. Right?

Well, we’re going to find out! The Nobel Foundation has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016. Let’s load it in and take a look.

Task 1: Instructions

Load the required libraries and the Nobel Prize dataset.

  • Load the tidyverse library.
  • Use read_csv (not read.csv) to read in datasets/nobel.csv and save it into nobel.
  • Show the head of nobel, that is, the first couple of prize winners.

Make sure to use read_csv (with an underscore) to read in the data. The read.csv function, which is built into R, has a number of problems which the new read_csv function avoids.

Good to know

This Project assumes you have used the dplyr and ggplot2 packages and that you are familiar with the pipe operator (%>%). Before taking on this Project, we recommend that you have completed the course Introduction to the Tidyverse.

RStudio has created some very helpful cheat sheets, including two that will be helpful for this Project: Data Wrangling and Data Visualization with ggplot2. We recommend that you keep them open in a separate tab to make it easy to refer to them.

2. So, who gets the Nobel Prize?

Just looking at the first couple of prize winners, or Nobel laureates as they are also called, we already see a celebrity: Wilhelm Conrad Röntgen, the guy who discovered X-rays. And actually, we see that all of the winners in 1901 were guys that came from Europe. But that was back in 1901, looking at all winners in the dataset, from 1901 to 2016, which sex and which country is the most commonly represented? *** (For country, we will use the birth_country of the winner, as the organization_country is NA for all shared Nobel Prizes.)

Task 2: Instructions

Count up the Nobel Prizes. Also, split by sex and birth_country.

  • Count and display the number of rows/prizes using the count() function.
  • Count and display the number of rows/prizes, grouped by sex.
  • Count the number of rows/prizes, grouped by birth_country. Arrange the result by no. prizes in descending order and display the first 20 rows using head(20). For how to use the group_by function to group by a column check out Group Cases in the dplyr cheat sheet. For how to arrange rows, take a look at Arrange Cases in the same cheat sheet.
# Counting the number of (possibly shared) Nobel Prizes handed
# out between 1901 and 2016
nobel %>% count()
# Counting the number of prizes won by male and female recipients.
nobel %>% 
    count(sex)
# Counting the number of prizes won by different nationalities.
nobel %>%
    group_by(birth_country) %>% count() %>% arrange(desc(n)) %>% head(20)

3. USA dominance

Not so surprising perhaps: the most common Nobel laureate between 1901 and 2016 was a man born in the United States of America. But in 1901 all the laureates were European. When did the USA start to dominate the Nobel Prize charts?

Task 3: Instructions

Calculate the proportion of USA born winners per decade starting from the nobel dataset and put the result into prop_usa_winners.

  • Add a usa_born_winner column to nobel, where the value is TRUE when birth_country is "United States of America".
  • Add a decade column to nobel showing the decade the prize was awarded (1953 should become 1950, for example).
  • Group by decade and use summarize to add the column proportion to nobel. proportion should contain the proportion of usa_born_winners for each decade.
  • Display / print out prop_usa_winners.

You can use mutate for the first two bullet points.

To calculate the proportion of TRUE values you can use the mean function. If the column includes NA values, you would have to use the na.rm argument. Here’s how you could use mean together with summarize:

my_data %>% group_by(my_categorical_variable) %>% summarize(proportion = mean(is_winner, na.rm = TRUE))

# Calculating the proportion of USA born winners per decade
prop_usa_winners <- nobel %>% 
    mutate(usa_born_winner = birth_country == "United States of America")
# Display the proportions of USA born winners per decade
prop_usa_winners <- prop_usa_winners %>% 
        mutate(decade = floor(year / 10) * 10) %>% 
        group_by(decade)  %>% 
        summarize(proportion = mean(usa_born_winner, na.rm = TRUE))
prop_usa_winners

4. USA dominance, visualized

A table is OK, but to see when the USA started to dominate the Nobel charts we need a plot!

Task 4: Instructions

Plot the proportion of USA born winners per decade.

  • Use ggplot to plot prop_usa_winners with decade on the x-axis and proportion on the y-axis as a line-and-dot-plot. That is, add both geom_line() and geom_point().
  • Fix the y-scale to that it shows percentages, its limits go from 0.0 to 1.0, and extra spacing is removed above and below 0.0 and 1.0.

To change the y-axis use scale_y_continuous and set the labels, limits, and expand arguments. Check the ggplot2 documentation for how to use limits and expand. Here is a StackOverflow question that shows how to set labels correctly.

# Setting the size of plots in this notebook
options(repr.plot.width=7, repr.plot.height=4)
# Plotting USA born winners
ggplot(data = prop_usa_winners, aes(x = decade, y = proportion))+
    geom_line()+
    geom_point()+
    scale_y_continuous(labels = scales::percent,
                   limits = 0:1, expand = c(0,0))

5. What is the gender of a typical Nobel Prize winner?

So the USA became the dominating winner of the Nobel Prize first in the 1930s and has kept the leading position ever since. But one group that was in the lead from the start, and never seems to let go, are men. Maybe it shouldn’t come as a shock that there is some imbalance between how many male and female prize winners there are, but how significant is this imbalance? And is it better or worse within specific prize categories like physics, medicine, literature, etc.?

Task 5: Instructions

Plot the proportion of female laureates by decade split by prize category.

  • Add female_winner column, where the value is TRUE when sex is "Female".
  • Add the column decade showing the decade the prize was awarded (1953 should become 1950, for example).
  • Group by decade and category and summarize the proportion of female_winner into the proportion column.
  • Copy and paste your ggplot code from task 4, except plot the prop_female_winners data and map the category variable to the color parameter.

This task can be solved by copying and modifying the code from task 3 and 4.

# Calculating the proportion of female laureates per decade
prop_female_winners <- nobel %>%
        mutate(female_winner = sex == "Female") %>% 
        mutate(decade = floor(year/10)*10) %>% 
        group_by(decade, category) %>% 
        summarize(proportion = mean(female_winner, na.rm = TRUE))
    
# Plotting the proportion of female laureates per decade
ggplot(data = prop_female_winners, aes(x = decade, y = proportion, color = category ))+
    geom_line()+
    geom_point()+
    scale_y_continuous(labels = scales::percent,
                   limits = 0:1, expand = c(0,0))

6. The first woman to win the Nobel Prize

The plot above is a bit messy as the lines are overplotting. But it does show some interesting trends and patterns. Overall the imbalance is pretty large with physics, economics, and chemistry having the largest imbalance. Medicine has a somewhat positive trend, and since the 1990s the literature prize is also now more balanced. The big outlier is the peace prize during the 2010s, but keep in mind that this just covers the years 2010 to 2016.

Given this imbalance, who was the first woman to receive a Nobel Prize? And in what category?

Task 6: Instructions

Extract and display the row showing the first woman to win a Nobel Prize.

  • Use filter to filter away all non-"Female" laureates.
  • Use top_n to pick out the row with the earliest year.

top_n(x, n, wt) is a useful function that takes a table x and picks out the top n rows as ordered by the column wt. By default top_n sort highest-to-lowest so to pick out five best offers in the bargain bin, you would have to use desc():

bargain_bin %>% top_n(5, desc(price))

nobel %>%
    filter(sex == "Female") %>% 
        top_n(1, desc(year))

7. Repeat laureates

For most scientists/writers/activists a Nobel Prize would be the crowning achievement of a long career. But for some people, one is just not enough, and there are few that have gotten it more than once. Who are these lucky few? (Having won no Nobel Prize myself, I’ll assume it’s just about luck.)

Task 7: Instructions

Extract and display the names of repeat Nobel Prize winners.

  • Use count to count the number of wins grouped by full_name.
  • Filter away all winners that “only” won one time.
# Selecting the laureates that have received 2 or more prizes.
nobel %>%
    group_by(full_name) %>% 
        count() %>% 
            filter(n>1) %>% arrange(desc(n))

8. How old are you when you get the prize?

The list of repeat winners contains some illustrious names! We again meet Marie Curie, who got the prize in physics for discovering radiation and in chemistry for isolating radium and polonium. John Bardeen got it twice in physics for transistors and superconductivity, Frederick Sanger got it twice in chemistry, and Linus Carl Pauling got it first in chemistry and later in peace for his work in promoting nuclear disarmament. We also learn that organizations also get the prize as both the Red Cross and the UNHCR have gotten it twice.

But how old are you generally when you get the prize?

Task 8: Instructions

Calculate and plot the age of each winner when they won their Nobel Prize.

  • Load the lubridate package (you’ll find the year() function useful).
  • mutate the nobel table to include the column age which should be how old people were when they got their price. Assign the resulting table to nobel_age.
  • Use ggplot to plot age as a function of year as a scatter plot (geom_point()) with a smooth trend (geom_smooth()).

The year() function from lubridate takes a date and extracts the year:

dates <- as.Date( c(“1985-04-02”, “1988-07-25”)) year(dates) ## [1] 1985 1988

# Loading the lubridate package
library(lubridate)
package <U+393C><U+3E31>lubridate<U+393C><U+3E32> was built under R version 3.5.1
Attaching package: <U+393C><U+3E31>lubridate<U+393C><U+3E32>

The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:

    date
# Calculating the age of Nobel Prize winners
nobel_age <- nobel %>%
    mutate(age = year - year(birth_date))
# Plotting the age of Nobel Prize winners
ggplot(data = nobel_age, aes(x = year, y = age))+
    geom_point()+
    geom_smooth()

9. Age differences between prize categories

The plot above shows us a lot! We see that people use to be around 55 when they received the price, but nowadays the average is closer to 65. But there is a large spread in the laureates’ ages, and while most are 50+, some are very young.

We also see that the density of points is much high nowadays than in the early 1900s – nowadays many more of the prizes are shared, and so there are many more winners. We also see that there was a disruption in awarded prizes around the Second World War (1939 - 1945).

Let’s look at age trends within different prize categories.

Task 9: Instructions

Plot how old winners are within the different price categories.

  • Use ggplot to plot age as a function of year as a scatter plot (geom_point()) with a smooth trend (geom_smooth()) and facet by category using facet_wrap.
  • Optional: Remove the confidence band in geom_smooth by setting se = FALSE.

This is the same plot as in task 8, except faceted by category.

Removing the confidence band in geom_smooth is not strictly necessary, but the bands are not that meaningful and removing them makes the plot more focused.

If you don’t remember how facet_wrap works then look under Faceting on the second page of the ggplot2 cheat sheet.

# Same plot as above, but faceted by the category of the Nobel Prize
ggplot(data = nobel_age, aes(x = year, y = age))+
    geom_point()+
    geom_smooth(se = FALSE)+
    facet_wrap(~ category)

10. Oldest and youngest winners

Another plot with lots of exciting stuff going on! We see that both winners of the chemistry, medicine, and physics prize have gotten older over time. The trend is strongest for physics: the average age used to be below 50, and now it’s almost 70. Literature and economics are more stable, and we also see that economics is a newer category. But peace shows an opposite trend where winners are getting younger!

In the peace category we also a winner around 2010 that seems exceptionally young. This begs the questions, who are the oldest and youngest people ever to have won a Nobel Prize?

Task 10: Instructions

Pick out the rows of the oldest and the youngest winner of a Nobel Prize.

  • Use top_n to pick out and display the row of the oldest winner.
  • Use top_n to pick out and display the row of the youngest winner. Remember that you can use desc to reverse the sorting:

The most expensive item

bargain_bin %>% top_n(1, price)

The cheapest item

bargain_bin %>% top_n(1, desc(price))

# The oldest winner of a Nobel Prize as of 2016
nobel_age %>% top_n(1, age)
# The youngest winner of a Nobel Prize as of 2016
nobel_age %>% top_n(1, desc(age))

11. You get a prize!

Hey! You get a prize for making it to the very end of this notebook! It might not be a Nobel Prize, but I made it myself in paint so it should count for something. But don’t despair, Leonid Hurwicz was 90 years old when he got his prize, so it might not be too late for you. Who knows.

Before you leave, what was again the name of the youngest winner ever who in 2014 got the prize for “[her] struggle against the suppression of children and young people and for the right of all children to education”?

Task 11: Instructions

  • Assign the name of the youngest winner of a Nobel Prize to youngest_winner. The first name will suffice.

If you want to know more The Nobel Prize dataset is rich and there and this Project just scratched the surface – there is much more to explore! After you have completed this Project you can download it and continue exploring on your own computer! To do that you will have to install Jupyter notebooks with support for R. Here are instructions for how to install the Jupyter Notebook interface and here are instructions for how to add support for R. Good luck!

# The name of the youngest winner of the Nobel Prize as of 2016
youngest_winner <- "Malala Yousafzai"
youngest_winner
[1] "Malala Yousafzai"
