In this project, I use a dataset that includes information about every nobel prize laureate between 1901 (when the prize was first awarded) and 2016. This includes awards in physics, chemistry, literature, medicine, economics, and peace. From this data, I seek to dissect the demographics of this honored populace. Is there gender parity? Given that Malala Yousafzai recently won at the young age of 17, is this a sign of laureate ages skewing younger? How “westernized” is the award based on laureate nationalities? Finally, I add in another dataset that compiled publication records for chemistry prize winners to analyze the delay between publications and nominations.

First and foremost, we need to install and load the necessary packages:

install.packages(c("tidyverse", "janitor", "lubridate", "ggplot2","tmap","sf","rnaturalearth","rnaturalearthdata"))

library(tidyverse)
library(janitor)
library(lubridate)
library(ggplot2)
library(tmap)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)

Next, we need to bring in our dataset and clean its column headers:

nobelwinners <- read_csv('nobel_laureates.csv') %>% janitor::clean_names()
view(nobelwinners)

Part 1: Gender

QUESTION: How balanced (or lopsided) are nobel laureates on the basis of sex. Has this changed over time?

GOAL: Create a population pyramid of laureates over time differentiated by sex

It should be noted here that some laureates are not individuals but in fact organizations. For the purposes of this analysis, we need to isolate key variables and remove those values (“NA” in our sex column)

CleanSex <- nobelwinners %>%
  select(year, sex) %>%
  na.omit(sex, cols=c(!"sex"))

Now we need to group by decade and tally up the number of males and females that won within each time period. “.drop” is used to ensure zero values are also included.

YearsGrouped <- mutate(
  CleanSex,
  decade=cut(
    year,
    breaks=c(1900,1909,1919,1929,1939,1949,1959,1969,1979,1989,1999,2009,2020),
    labels=c("1901-1909","1910-1919","1920-1929","1930-1939","1940-1949","1950-1959","1960-1969","1970-1979","1980-1989","1990-1999","2000-2009","2010-2016")
  )
)

SexGrouped <-
  YearsGrouped %>%
  count(sex,decade,name="sex_by_decade",.drop=FALSE)

We’re almost there, but for a population pyramid, we need the data for males to be negative!

PPData <- mutate(SexGrouped, value=if_else(SexGrouped$sex == "Male",  -SexGrouped$sex_by_decade, SexGrouped$sex_by_decade))

Now, we can finally plot the data *Note: this code was adapted from a much more complicated version found on stack overflow

SexPlot <- ggplot(PPData, aes(x = decade, y = value, fill = sex)) +
  geom_bar(stat = "identity") +
  labs(title = "Gender Distribution of Nobel Laureates Across the Decades",
       subtitle = "No matter the decade, it is abundantly clear that Nobel laureates are
predominantly men. That being said, we can see a slight uptick in female
laureates over time (although this may simply be a product of more awards
being given out over time).",
       x = "Decade",
       y = "Number of Laureates",
       caption = "Source: Nobel Foundation at Kaggle") +
  coord_flip() +
  theme_bw()

Part 2: Age

QUESTION: Over time, has there been any variation in the age at which nobel laureates win their award?

GOAL: Create a scatterplot of laureate age over time with added loess curve to identify trend

First, separate the year of birth for each winner and subtract birth year from prize year to find age

birthYear <- mutate(nobelwinners, birthyear=year(birth_date))
Age <- mutate(birthYear, Age=year-birthyear)

Next, isolate key variables and clean the data to remove “NA” values (essentially removing organizations which obviously don’t have an age)

CleanAge <- Age %>%
  select(year, sex, Age) %>%
  na.omit(Age, cols=c(!"Age"))

Finally, plot the data (colors will be altered when exported into illustrator)

AgePlot <- ggplot(CleanAge, aes(x=year, y=Age, color=sex)) +
  geom_point() +
  ylim(c(0,100)) +
  geom_smooth(method="loess", se=F) +
  theme_bw() +
  labs(title="Changing Age Demography of Nobel Prize Laureates",
       subtitle="Over time, laureate age has been trending upward such that today's winners, on
average tend to be much older. Female laureates, of which there are notably very
few, have seen a reverse trend, likely affected by outlier Malala Yousafzai
being awarded in 2014 at the young age of 17.",
       y="Age",
       x="Year Awarded",
       caption="Source: Nobel Foundation at Kaggle")

Part 3: Nationality

QUESTION: In what country were the majority of laureates born?

GOAL: Create a map showing the number of laureates by country

First, isolate the key column of birth country and omit “NA” values

CleanNat <- nobelwinners %>%
  select(birth_country) %>%
  na.omit(birth_country, cols=c(!"birth_country"))

Now we find in the data that countries have changed over time (countries like Prussia and the Russian empire have been broken into many different countries today). The modern country is represented in parentheses which we need to isolate in order to do our analysis. This required some deep dives into regexp in r and some help from stack overflow.

Essentially, the function looks at the string in each row and if it contains parentheses, it extracts the text inside the parentheses, but if not it simply returns the original string

ModernNat <- mutate(
  CleanNat, country=ifelse(str_detect(birth_country,"\\("),
      gsub("[\\(\\)]", "",
      regmatches(birth_country, gregexpr("\\(.*?\\)",
      birth_country))),birth_country)
)

Now that we have clean country names, we can group them and tally

NatGrouped <- ModernNat %>%
  group_by(country) %>%
  tally(sort=TRUE)

And now we can join our data to world data

world <- ne_countries(scale="medium",returnclass="sf")
WorldData <- left_join(world, NatGrouped, by = c("sovereignt"="country"))

Finally, we can create a thematic map with this data (in this case I use Eckert IV projection and Jenks classification)

tm_shape(WorldData, projection = "eck4") + 
  tm_polygons("n", style = "jenks", palette = "OrRd", title = "Laureate Count") +
  tm_credits("Source: Nobel Foundation at Kaggle, tmap World Data") +
  tm_style_white(legend.frame = TRUE) +
  tm_layout(main.title = "Geographic Distribution of Nobel Laureates")

Part 4: Publication History

QUESTION: Looking just at Chemistry Prize Laureates, how long is the gap between publication and nomination? Has this gap changed over time?

GOAL: Create a scatterplot with a loess curve showing the delay over time

First, we need to bring in and clean the new dataset

chempubs <- read_csv('Chemistry_publications.csv') %>% janitor::clean_names()
view(chempubs)

Next, we can filter out just the award-winning papers. Unfortunately, some laureates win on multiple papers. Thus, for this analysis, I filtered by the earliest winning paper for each laureate. We can then isolate our key variables and subtract our publication year from the prize year to find the delay.

WinningPubs <- chempubs %>%
  filter(is_prize_winning_paper=="YES")
EarliestWonPub <- WinningPubs %>%
  group_by(laureate_name) %>%
  slice(which.min(pub_year))
CleanPubs <- EarliestWonPub %>%
  select(laureate_name,prize_year,pub_year)
Delay <- mutate(
  CleanPubs,
  delay=prize_year-pub_year
)

Finally, we can plot this in a scatterplot

DelayPlot <- ggplot(Delay, aes(x=prize_year, y=delay)) +
  geom_point() +
  ylim(c(0,50)) +
  geom_smooth(method="loess") +
  theme_bw() +
  labs(title="Delay Between an Award-Worthy Publication and the Award",
       subtitle="Over time, at least for Chemistry Prize Winners, there is a growing gap between when a laureate
publishes research and when they actually receive a reward for said research. Laureates today can
expect an average of 28 years between publication and nomination.",
       y="Delay (Years)",
       x="Year Awarded",
       caption="Source: Harvard Dataverse")